Create Your Own NLP Search Engine With BM25

Sudeep Last Updated : 30 Jan, 2025

5 min read

Search engines like Google and Yahoo work through Crawling, Indexing, and ranking methods like BM25. Crawling is when automated bots find new or updated pages and store key details like URLs, titles, and keywords. Indexing analyzes this data, identifying key content, images, and videos to store for future searches. BM25, a ranking algorithm, helps retrieve the most relevant results based on keyword relevance. When you search, engines don’t scan the entire internet but retrieve results from their indexed data. Today, we’ll build a small prototype that mimics the indexing process of a search engine.

This article was published as a part of the Data Science Blogathon.

Importing packages
- What is BM25?
Preparing your tweets
Tokenizing tweets and running BM25
Top Five associated Tweets
Additional use cases of BM25
Frequently Asked Questions

Importing packages

import pandas as pd
from rank_bm25 import *

What is BM25?

BM25 is a simple Python package and can be used to index the data, tweets in our case, based on the search query. It works on the concept of TF/IDF i.e.

TF or Term Frequency — Simply put, indicates the number of occurrences of the search term in our tweet
IDF or Inverse Document Frequency — It measures how important your search term is. Since TF considers all terms equally important, thus, we can’t only use term frequencies to calculate the weight of a term in your text. We would need to weigh down the frequent terms while scaling up the rare terms showing their relevancy to the tweet.

Once you run the query, BM25 will show the relevancy of your search term with each of the tweets. You can sort it to index the most relevant ones.

Preparing your tweets

Since this is not a discussion on Twitter API, will start using an excel based feed. You can clean your text data on these key steps to make the search more robust.

Tokenization:

Splitting the sentence into words. So that each word can be considered uniquely.

import pandas as pd
from rank_bm25 import *
import warnings
warnings.filterwarnings('ignore')
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
sentence = "Jack is a sharp minded fellow"
words = word_tokenize(sentence)
print(words)

Removing special characters:

Removing the special characters from your tweets

def spl_chars_removal(lst):
    lst1=list()
    for element in lst:
        str=””
        str = re.sub(“[⁰-9a-zA-Z]”,” “,element)
        lst1.append(str)
    return lst1

Removing stop words:

Stop words are commonly used words (is, for, the, etc.) in the tweets. These words do not signify any importance as they do not help in distinguishing two tweets. I used Gensim package to remove my stopwords, you can also try it using nltk, but I found Gensim much faster than others.

One can also easily add new words to the stop words list, in case your data is particularly surrounded with those words and is frequently occurring.

#adding words to stopwords
from nltk.tokenize import word_tokenize
from gensim.parsing.preprocessing import STOPWORDS

#adding custom words to the pre-defined stop words list
all_stopwords_gensim = STOPWORDS.union(set([‘disease’]))

def stopwprds_removal_gensim_custom(lst):
    lst1=list()
    for str in lst:
        text_tokens = word_tokenize(str)
        tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]
        str_t = “ “.join(tokens_without_sw)
        lst1.append(str_t)
 
    return lst1

Normalization:

Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near-identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”.

This technique is important for noisy texts such as social media comments, text messages, and comments to blog posts where abbreviations, misspellings, and use of out-of-vocabulary words (oov) are prevalent. People tend to write comments in short-hand and hence this pre-processing becomes very important.

Raw	Normalized
yest, yday	yesterday
tomo, 2moro, 2mrw, tmrw	tomorrow
brb	be right back

Stemming:

Process of transforming the words to their root form. It’s the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word.

Stemming uses a heuristic process that chops off the ends of words in the hope of correctly transforming words into their root form. It needs to be reviewed as in the below example you can see “Machine” gets transformed to “Machin”, “e” is chopped off in the stemming process.

import nltk
from nltk.stem
import PorterStemmer
ps = PorterStemmer() sentence = “Machine Learning is cool”
for word in sentence.split():
    print(ps.stem(word))

Output: ['Machin', 'Learning', 'cool']

Tokenizing tweets and running BM25

This is the central piece where we run the query for search. We search the tweets based on the word “vaccine” user-based. One can enter a phrase too and it will fluently as we tokenize our search term in the 2nd line below.

tokenized_corpus = [doc.split(" ") for doc in lst1]
bm25 = BM25Okapi(tokenized_corpus)
query = "vaccine" ## Enter search query
tokenized_query = query.split(" ")

You can check the association of each tweet with your search term using .get_scores function.

doc_scores = bm25.get_scores(tokenized_query)
print(doc_scores)

As we enter n=5 in .get_top_n we would get five most associated tweets as our result. You can put the value of n according to your needs.

docs = bm25.get_top_n(tokenized_query, lst1, n=5)
df_search = df[df['Text'].isin(docs)]
df_search.head()

Top Five associated Tweets

Top 5 Tweets	Tweeted By
@MikeCarlton01 Re #ABC funding, looked up Budget Papers. After massive prior cuts, it got extra $4.7M in funding (.00044% far less than inflation).#Morrison wastes $Ms on over-priced & ineffective services eg useless #Covid app.; delivery vaccine #agedcare; consultancies vaccine roll-out..	MORRIGAN
@TonyHWindsor @barriecassidy @4corners @abc730 For its invaluable work, #ABC got extra $4.7M in funding (.00044% far less than inflation).While #Morrison Govt spends like drunken sailor on buying over-priced & ineffective services from mates (eg useless #Covid app.; delivery vaccine #agedcare; vaccine roll-out) #auspol	MORRIGAN
It’s going to be a month after my #Covid recovery. Now I will go vaccine 😎😎😎😎	Simi Elizabeth😃
RT @pradeepkishan : What a despicable politician is #ArvindKejariwal ! The minute oxygen hoarding came to light his propaganda shifted to vaccine shortage. He is more dangerous than #COVID itself! @BJP4India @TajinderBagga	p.hariharan
RT @AlexBerenson : TL: DR – In the @pfizer teen #Covid vaccine trial, 4 or 5 (the exact figure is hidden) of 1,100 kids who got the vaccine had serious side effects, compared to 1 who got placebo.@US_FDA did not disclose specifics, so we have no idea what they were or if they follow any pattern. https://t.co/n5igf2xXFN	Sagezza

Additional use cases of BM25

There can be many use cases where a search feature is required. One of the most relevant ones is around parsing the PDF and developing a search function over the PDF content.

This is one of the widely used cases for BM25. As the globe slowly shifts to better data strategy and efficient storage techniques, the old PDF documents can be retrieved efficiently using algorithms like BM25.

Hope you enjoyed reading this and find this helpful. Thank you, folks!

Conclusion

Importing packages provides essential tools. BM25 ranks document relevance. Preparing tweets involves cleaning data. Tokenization breaks text into words. Removing special characters and stop words improves focus. Normalization ensures consistency. Stemming reduces words to root forms. Running BM25 finds relevant tweets. Top five tweets are most relevant. BM25 also aids search engines.

Reference Links

Frequently Asked Questions

Q1.What is the BM25 method?

BM25 is a ranking algorithm used to score and rank documents based on their relevance to a search query. It considers term frequency (TF) and document length to improve accuracy.

Q2.Why is BM25 better than TF-IDF?

BM25 is better because it handles term frequency saturation (too many repetitions of a term don’t over-influence the score) and accounts for document length, making it more effective for real-world search scenarios.

Q3.What is BM25 in Elasticsearch?

In Elasticsearch, BM25 is the default ranking algorithm used to calculate relevance scores for search results, replacing TF-IDF for better accuracy and performance.

Sudeep

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Create Your Own NLP Search Engine With BM25

Table of contents

Importing packages

What is BM25?

Preparing your tweets

Tokenization:

Removing special characters:

Removing stop words:

Normalization:

Stemming:

Tokenizing tweets and running BM25

Top Five associated Tweets

Additional use cases of BM25

Conclusion

Reference Links

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth