Extracting information from reports using Regular Expressions Library in Python

Yogesh Last Updated : 29 Oct, 2024

4 min read

Introduction

Many times it is necessary to extract key information from reports, articles, papers, etc. For example names of companies – prices from financial reports, names of judges – jurisdiction from court judgments, account numbers from customer complaints, etc.

These extractions are part of Text Mining and are essential in converting unstructured data to a structured form which are later used for applying analytics/machine learning.

Such entity extraction uses approaches like ‘lookup’, ‘rules’ and ‘statistical/machine learning’. In ‘lookup’ based approaches, words from input documents are searched against pre-defined data dictionary. In ‘rules’ based approach, pattern searches are made to find key information. Whereas in ‘statistical’ approach supervised-unsupervised methods are used to extract the information.

‘Regular expression (RegEx)’ is one of the ‘rules’ based pattern search method.

Basic syntax

Python supports regular expressions by the library called “re”(though it’s not fully Perl-compatible). Instead of regular strings, search patterns are specified using raw strings “r”, so that backslashes and meta characters are not interpreted by python but sent to RegEx directly.

Go through the following table of basic syntax.

abc…	Letters	{m}	m Repetitions
123…	Digits	{m,n}	m to n Repetitions
\d	Any Digit	*	Zero or more repetitions
\D	Any Non-digit character	+	One or more repetitions
.	Any Character	?	Optional character
\.	Period	\s	Any Whitespace
[abc]	Only a, b, or c	\S	Any Non-whitespace character
[^abc]	Not a, b, nor c	^…$	Starts and ends
[a-z]	Characters a to z	(…)	Capture Group
[0-9]	Numbers 0 to 9	(a(bc))	Capture Sub-group
\w	Any Alphanumeric character	(.*)	Capture all
\W	Any Non-alphanumeric character	(abc\|def)	Matches abc or def

Go through the following Python sample code for usage of RegEx.

Python Code:

import re
regex = r"([a-zA-Z]+) (\d+)" 
text = "June 24" 
match = re.search(regex, text) 
if match:
    #   match.group() or match.group(0) always returns the fully matched string i.e. "June 24"     
  print ("Match: %s" % (match.group(0)))
    #   match.group(1) match.group(2), ... will return the capture groups in order
  print ("Month: %s" % (match.group(1))) 
  print ("Day: %s" % (match.group(2))) # "24" 
else:
    # If re.search() does not match, then None is returned     
  print ("The regex pattern does not match. :(")

Instead of “re.search”, which returns all the exact matches, “re.findall()” can be used to return all captured groups. “re.sub” is used to substitute another pattern as a replacement for the given search pattern. For performance reasons, it is recommended to compile the pattern first using “re.compile” and then use the RegEx object for searching, as shown below.

regex = re.compile(r"(\w+) Lamb") 
text = "Mary had a little Lamb" 
result = regex.search(text)

More information about RegEx usage in Python can be found at Regex One and in this AV article.

Use Cases

Imagine writing code for searching telephone numbers like +91-9890251406 in a document, with multiple variations in format. With validations, the code will typically be surely more than 10 lines (sample here ). But with RegEx, it’s just about 2/3 of lines of code, and with high customizability.

Following are some of the frequently occurring scenarios where RegEx can offer substantial help. Please note that the examples shown could have alternate ways of getting same results, especially by using meta characters such as “/d” for “[0-9]” representing digits. In most of the examples, expressive and simplistic patterns are used here just for clarity and understandability.

Finding email

“^[a-zA-Z0-9_\-]+@[a-zA-Z0-9_\-]+\.[a-zA-Z0-9_\-]”	[email protected]
“@\w+.\w+”	Domain within email

Finding telephone number

“([0-9]{3}-){2}[0-9]{4}[^0-9]*$”	xxx-xxx-xxxx
“([0-9]{3}.){2}[0-9]{4}[^0-9]*$”	xxx.xxx.xxxx
“[0-9]{10}[^0-9]*$”	xxxxxxxxxx
“\$[0-9]{3}\$[0-9]{3}-[0-9]{4}[^0-9]*$”	(xxx)xxx-xxxx

The first two cases differ only in “-“or “.” and thus can combined using “(-|\.)”

A sample code for more elaborate phone number is as follows:

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?                       # area code
    (\s|-|\.)?                                     # separator
    (\d{3})                                           # first 3 digits
    (\s|-|\.)                                        # separator
    (\d{4})                                           # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?  # extension
    )''', re.VERBOSE)

Finding date

“\d{2}-\d{2}-\d{4}”	xx-xx-xxxx without checking digits
“([2-9]\|1[0-2]?)-[0-3][1-9]-[1-2][9\|0][0-9]{2}[^0-9]*$”	xx-xx-xxxx with stricter usage
“([2-9]\|1[0-2]?)/[0-3][1-9]/[1-2][9\|0][0-9]{2}”	xx/xx/xxxx
“[1-2][9\|0][0-9]{2}/([2-9]\|1[0-2]?)/[0-3][1-9]”	xxxx/xx/xx

The date pattern used above are only numeric. There are other usages such as ’27-Mar-1973′ or ’27 March 1973′. I would leave this as an open quiz and would want the panelist to think about their RegEx patterns!!

Finding account /credit card number

“([0-9]{4}-){3}[0-9]{4}”	xxxx-xxxx-xxxx-xxxx
“[0-9]{16}”	xxxxxxxxxxxxxxxx

Adding linked Information

Citations in search papers or judgments have pre-defined formats and they refer to an external document. It is possible to append the hyperlink information by replacing the citation text.

For example, US legal judgments citation looks like “17 U.S.C. § 107”. The pattern is : text “U.S.C.”, another space, a § mark, another space, a set of numbers, and optionally, a year inside a parenthetical. It can be replaced with a “<a href …/a>” hyper link to actual judgment it refers to.

Tools for development, testing and debugging

Although RegEx is powerful but it can get complicated for non-trivial tasks. More challenging it would be if you must understand (and debug) RegEx by someone else!

There are quite a few friendly utilities which help in development and testing of RegEx. Try Regex 101 . It gives facility to put your own text and try RegEx pattern.

Here is one sample text to try on Phone number, account number, date patterns, mentioned before.

+919890251406 01110100 555.867.5309 01101000

9890251406 1.4142135623 01101001 27/3/1973 987-01-6661

01110011 202.555.9355 00100000 01101001 91-020-25898963

912025898963 3.1415926535897932384626433832795 666-

12-4895 01100001 202-555-9355 27-03-1973 00100000

01101000 (555) 867-5309 27-Mar-1973 2.718281828459 555-

867-5309 01100101 01110011 01110011 555/867-5309

Sites like RegExper give visual representation of the RegEx search pattern for better understanding. Refer to the below visualization for PhoneRegex search pattern mentioned earlier, used for phone numbers.

Once RegEx gives acceptable matches, the pattern can be used in programs. After good enough practice one can directly code the search patterns in the program itself.

All the RegEx patterns used here, with some minor modifications, can be used in programming language like Python, Perl, Java, etc. It can also be used in some of the popular text editors for Find-Replace functionality, like Microsoft Word (keep “Use Wildcard” option ON), OpenOffice and in IDEs like PyCharm. Read the comprehensive information about RegEx here .

End Note

RegEx is a versatile, portable and powerful way of extracting key information from textual data. Mastery over it can help automate many mundane tasks. Although, at times, it can get complicated and hard to develop-debug but owing to its immense capabilities it has become a must weapon in every programmer’s armour, especially for text analytics data scientists.

Let me conclude by giving a food for thought: Can RegEx be used to solve a crossword puzzle?

Drop your answers below. If you have questions feel free to post them in the comments section.

By Analytics Vidhya Team: This article was contributed by Yogesh Kulkarni who is the second rank holder of Blogathon 3.

Learn, compete, hack and get hired!

Yogesh

Yogesh H. Kulkarni is currently pursuing full-time PhD in the field of Geometric modeling, after working in the same domain for more than 16 years. He is also keenly interested in data sciences, especially Natural Language Processing, Machine Learning and wishes to pursue further career in these fields.

Free Courses

4.5

Model Deployment using FastAPI; Prepare, Train, and Test FastAPI Application

Learn how to deploy FastAPI and deploy ML model using FastAPI for real apps

4.5

Building a Deep Research AI Agent

Build a Research & Report Agent with LangGraph & OpenAI for under $1!

Build Data Pipelines with Apache Airflow

Learn ETL pipeline building and workflow orchestration with Airflow.

4.6

Introduction to Transformers and Attention Mechanisms

Learn attention mechanisms, RNNs, Seq2Seq, BERT & NLP applications.

4.6

Getting Started with Large Language Models

Embark on an LLM journey: Master NLP and model training

Pratima Joshi

Nice article Yogesh. Yes, I think RegEx can be used to solve a crossword puzzle. How, I don't know :)

furas

Your regex can't find some usual emails: [email protected] [email protected] [email protected] There are also less usual but valid emails - some examples in wikipedia (https://en.wikipedia.org/wiki/Email_address#Examples) But there can be so many unusual and valid emails so there is no universal regex which can find all of them :/

frankz

nice article. but there is a bug in phone regex. phoneRegex.search('(123)123456-7890') would match, which is wrong. Also email doesn't accept this type email 'a@c.d', which it should.

123 1


Δ

Show 1 reply

frankz

the email address is this "a&ltb&[email protected]"

123 456


Δ

Reading list

Extracting information from reports using Regular Expressions Library in Python

Introduction

Basic syntax

Use Cases

Finding email

Finding telephone number

Finding date

Finding account /credit card number

Adding linked Information

End Note

Learn, compete, hack and get hired!

Login to continue reading and enjoy expert-curated content.

Free Courses

Model Deployment using FastAPI; Prepare, Train, and Test FastAPI Application

Building a Deep Research AI Agent

Build Data Pipelines with Apache Airflow

Introduction to Transformers and Attention Mechanisms

Getting Started with Large Language Models

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Extracting information from reports using Regular Expressions Library in Python

Introduction

Basic syntax

Use Cases

Finding email

Finding telephone number

Finding date

Finding account /credit card number

Adding linked Information

End Note

Learn, compete, hack and get hired!

Login to continue reading and enjoy expert-curated content.

Free Courses

Model Deployment using FastAPI; Prepare, Train, and Test FastAPI Application

Building a Deep Research AI Agent

Build Data Pipelines with Apache Airflow

Introduction to Transformers and Attention Mechanisms

Getting Started with Large Language Models

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques