Part 20: Step by Step Guide to Master NLP – Information Retrieval

Chirag Goyal Last Updated : 12 Nov, 2024

7 min read

This article was published as a part of the Data Science Blogathon

Introduction

This article is part of an ongoing blog series on Natural Language Processing (NLP). In the previous article, we completed our discussion on Topic Modelling Techniques. Now, in this article, we will be discussing an important application of NLP in Information Retrieval.

So, In this article, we will discuss the basic concepts of Information Retrieval along with some of the models that are used in Information Retrieval.

NOTE: In this article, we will discuss only the basics related to Information Retrieval. If you want to learn more about Information Retrieval, you can learn on your own otherwise you can ping me for making a blog series also on Information Retrieval in the Future.

This is part-20 of the blog series on the Step by Step Guide to Natural Language Processing.

1. What are Information Retrieval (IR) Systems?

2. Basics of IR Systems

3. Classical Problem in IR Systems

4. What are IR Models?

5. What are the types of IR Models?

6. What are Boolean Models?

7. Advantages and Disadvantages of Boolean Models

8. What are Vector Space Models?

9. How to Evaluate IR Systems?

Information Retrieval Systems

Information Retrieval- A Brief Overview | by Soumya Shukla | Medium

Image Source: Google Images

Firstly we will discuss what exactly is Information Retrieval?

Information retrieval is defined as the process of accessing and retrieving the most appropriate information from text based on a particular query given by the user, with the help of context-based indexing or metadata.

Google Search is the most famous example of information retrieval.

Now let’s discuss what are Information Retrieval Systems?

An information retrieval system searches a collection of natural language documents with the goal of retrieving exactly the set of documents that matches a user’s question. They have their origin in library systems.

These systems assist users in finding the information they require but it does not attempt to deduce or generate answers. It tells about the existence and location of documents that might consist of the required information that is given to the user. The documents that satisfy the user’s requirement are called relevant documents. If we have a perfect IR system, then it will retrieve only relevant documents.

Basics of IR Systems

Image Source: Google Images

From the above diagram, it is clear that a user who needs information will have to formulate a request in the form of a query in natural language. After that, the IR system will return output by retrieving the relevant output, in the form of documents, about the required information.

The step by step procedure of these systems are as follows:

Indexing the collection of documents.
Transforming the query in the same way as the document content is represented.
Comparing the description of each document with that of the query.
Listing the results in order of relevancy.

Retrieval Systems consist of mainly two processes:

Indexing
Matching

Indexing

It is the process of selecting terms to represent a text.

Indexing involves:

Tokenization of string
Removing frequent words
Stemming

Two common Indexing Techniques:

Boolean Model
Vector space model

Matching

It is the process of finding a measure of similarity between two text representations.

The relevance of a document is computed based on the following parameters:

1. TF: It stands for Term Frequency which is simply the number of times a given term appears in that document.

TF (i, j) = (count of ith term in jth document)/(total terms in jth document)

2. IDF: It stands for Inverse Document Frequency which is a measure of the general importance of the term.

IDF (i) = (total no. of documents)/(no. of documents containing ith term)

3. TF-IDF Score (i, j) = TF * IDF

Classical Problem in IR Systems

The main aim behind IR research is to develop a model for retrieving information from the repositories of documents. Ad-hoc retrieval problem is the classical problem in IR systems.

Now, let’s discuss what exactly is ad-hoc retrieval?

In ad-hoc retrieval, the user must have to enter a query in natural language that describes the required information. Then the IR system will return the output as the required documents that are related to the desired information.

For Example, suppose we are searching for something on the Internet and it gives some exact pages that are relevant as per our requirement but there can be some non-relevant pages too. This is due to the ad-hoc retrieval problem.

Aspects of Ad-hoc Retrieval

The aspects of ad-hoc retrieval that are addressed in IR research are as follows:

How users with the help of relevant feedback can improve the original formulation of a query?
How to implement database merging, i.e., how results from different text databases can be merged into one result set?
How to handle partly corrupted data? Which models are appropriate for the same?

Information Retrieval Models

Information retrieval models predict and explain what a user will find in relevance to the given query. These are basically a pattern that defines the above-mentioned aspects of retrieval procedure that we discussed in ad-hoc retrieval and consists of the following:

A model for documents.
A model for queries.
A matching function that compares queries to documents.

Mathematically, a retrieval model consists of the following components:

D: Representation for documents.
R: Representation for queries.
F: The modeling framework for D, Q along with the relationship between them.
R (q, di): A ranking or similarity function that orders the documents with respect to the query.

Types of IR Model

The following are three models that are classified for the Information model (IR) model:

Classical IR Models

These are the simplest and easy-to-implement IR models. These are based on mathematical knowledge that was easily recognized and understood as well.

Following are the examples of classical IR models:

Boolean models,
Vector models,
Probabilistic models.

Non-Classical IR Models

These are completely opposite to the classical IR models. These are based on principles other than similarity, probability, Boolean operations.

Following are the examples of Non-classical IR models:

Information logic models,
Situation theory models,
Interaction models.

Alternative IR Models

It is the enhancement of the classical IR model that makes use of some specific techniques from some other fields.

Following are the examples of Alternative IR models:

Cluster models,
Fuzzy models,
Latent Semantic Indexing (LSI) models.

Boolean Model

Boolean Model is the oldest model for Information Retrieval (IR). These models are based on set theory and Boolean algebra, where

Documents: Sets of terms
Queries: Boolean expressions on terms

As a response to the query, the set of documents that satisfied the Boolean expression are retrieved.

The boolean model can be defined as:

D: It represents a set of words, i.e, the indexing terms present in a document. Here, each term is either present (1) or absent (0) in the document.

Q: It represents a Boolean expression, where terms are the index terms and operators are logical products such as:

AND,
Logical sum − OR,
Logical difference − NOT.

F: It represents a Boolean algebra over sets of terms as well as over sets of documents.

If we talk about the relevance feedback, then in the Boolean IR model the Relevance prediction can be defined as follows:

R: A document is predicted as relevant to the query expression if and only if it satisfies the query expression as −

((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)

We can explain this model by a query term as an unambiguous definition of a set of documents.

For Example, suppose we have the query term “analytics”, which defines the set of documents that are indexed with the term “analytics”.

Now, think on what is the result after we combining terms with the Boolean ‘AND’ Operator?

After doing the ‘AND’ operation, it will define a document set that is smaller than or equal to the document sets of any of the single terms.

For Example, now we have the query with terms “Vidhya” and “analytics” that will produce the set of documents that are indexed with both the terms. In simple words, the document set with the intersection of both the sets described here.

Now, also think on what is the result after combining terms with the Boolean ‘OR’ operator?

After doing the ‘OR’ operation, it will define a document set that is bigger than or equal to the document sets of any of the single terms.

For Example, now we have the query with terms “Vidhya” or “analytics” that will produce the set of documents that are indexed with either the term “Vidhya” or “analytics”. In simple words, the document set with the union of both sets described here.

Advantages of the Boolean Model

Following are the advantages of the Boolean model:

1. It is the simplest model based on sets.

2. It is easy to understand and implement.

3. It only retrieves exact matches.

4. It gives the user, a sense of control over the system.

Disadvantages of the Boolean Model

Following are the disadvantages of the Boolean model:

1. The model’s similarity function is Boolean. Hence, there would be no partial matches. This can be annoying for the users.

2. In this model, the Boolean operator usage has much more influence than a critical word.

3. The query language is expressive, but it is complicated too.

4. There is no ranking for retrieved documents by the model.

Vector Space Model

As we have seen that there are some limitations in the Boolean model, so we have come up with a new model which is based on Luhn’s similarity criterion, which states that “the more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information”.

To understand more about the vector Space model, you have to understand the following points:

1. In this model, the index representations (documents) and the queries are represented by vectors in a T dimensional Euclidean space.

2. T represents the number of distinct terms used in the documents.

3. Each axis corresponds to one term.

4. Ranked list of documents ordered by similarity to the query where the similarity between a query and a document is computed using a metric on the respective vectors.

5. The similarity measure of a document vector to a query vector is usually the cosine of the angle between them.

https://i.stack.imgur.com/36r1U.png

Image Source: Google Images

Evaluation of IR Systems

The two common effective measures for evaluating IR systems are as follows:

Precision
Recall

Performance Evaluation of Information Retrieval Systems - ppt video online download

Image Source: Google Images

Precision: Precision is the Proportion of retrieved documents that are relevant.

Recall: The recall is the Proportion of relevant documents that are retrieved.

Ideally both precision and recall should be 1. In practice, these are inversely related.

This ends our Part-20 of the Blog Series on Natural Language Processing!

Email

For any queries, you can mail me on [email protected]

End Notes

Thanks for reading!

I hope that you have enjoyed the article. If you like it, share it with your friends also. Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you. 😉

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Chirag Goyal

I am a B.Tech. student (Computer Science major) currently in the pre-final year of my undergrad. My interest lies in the field of Data Science and Machine Learning. I have been pursuing this interest and am eager to work more in these directions. I feel proud to share that I am one of the best students in my class who has a desire to learn many new things in my field.

Advanced NLP Text

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Part 20: Step by Step Guide to Master NLP – Information Retrieval

Introduction

Table of Contents

Information Retrieval Systems

Basics of IR Systems

Indexing

Matching

Classical Problem in IR Systems

Aspects of Ad-hoc Retrieval

Information Retrieval Models

Types of IR Model

Classical IR Models

Non-Classical IR Models

Alternative IR Models

Boolean Model

Advantages of the Boolean Model

Disadvantages of the Boolean Model

Vector Space Model

Evaluation of IR Systems

This ends our Part-20 of the Blog Series on Natural Language Processing!

Other Blog Posts by Me

LinkedIn

Email

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#