This article was published as a part of the Data Science Blogathon
This article is part of an ongoing blog series on Natural Language Processing (NLP). In the previous article, we completed our discussion on Topic Modelling Techniques. Now, in this article, we will be discussing an important application of NLP in Information Retrieval.
So, In this article, we will discuss the basic concepts of Information Retrieval along with some of the models that are used in Information Retrieval.
NOTE: In this article, we will discuss only the basics related to Information Retrieval. If you want to learn more about Information Retrieval, you can learn on your own otherwise you can ping me for making a blog series also on Information Retrieval in the Future.
This is part-20 of the blog series on the Step by Step Guide to Natural Language Processing.
1. What are Information Retrieval (IR) Systems?
2. Basics of IR Systems
3. Classical Problem in IR Systems
4. What are IR Models?
5. What are the types of IR Models?
6. What are Boolean Models?
7. Advantages and Disadvantages of Boolean Models
8. What are Vector Space Models?
9. How to Evaluate IR Systems?
Image Source: Google Images
Firstly we will discuss what exactly is Information Retrieval?
Information retrieval is defined as the process of accessing and retrieving the most appropriate information from text based on a particular query given by the user, with the help of context-based indexing or metadata.
Google Search is the most famous example of information retrieval.
Now let’s discuss what are Information Retrieval Systems?
An information retrieval system searches a collection of natural language documents with the goal of retrieving exactly the set of documents that matches a user’s question. They have their origin in library systems.
These systems assist users in finding the information they require but it does not attempt to deduce or generate answers. It tells about the existence and location of documents that might consist of the required information that is given to the user. The documents that satisfy the user’s requirement are called relevant documents. If we have a perfect IR system, then it will retrieve only relevant documents.
Image Source: Google Images
From the above diagram, it is clear that a user who needs information will have to formulate a request in the form of a query in natural language. After that, the IR system will return output by retrieving the relevant output, in the form of documents, about the required information.
The step by step procedure of these systems are as follows:
Retrieval Systems consist of mainly two processes:
It is the process of selecting terms to represent a text.
Indexing involves:
Two common Indexing Techniques:
It is the process of finding a measure of similarity between two text representations.
The relevance of a document is computed based on the following parameters:
1. TF: It stands for Term Frequency which is simply the number of times a given term appears in that document.
TF (i, j) = (count of ith term in jth document)/(total terms in jth document)
2. IDF: It stands for Inverse Document Frequency which is a measure of the general importance of the term.
IDF (i) = (total no. of documents)/(no. of documents containing ith term)
3. TF-IDF Score (i, j) = TF * IDF
The main aim behind IR research is to develop a model for retrieving information from the repositories of documents. Ad-hoc retrieval problem is the classical problem in IR systems.
Now, let’s discuss what exactly is ad-hoc retrieval?
In ad-hoc retrieval, the user must have to enter a query in natural language that describes the required information. Then the IR system will return the output as the required documents that are related to the desired information.
For Example, suppose we are searching for something on the Internet and it gives some exact pages that are relevant as per our requirement but there can be some non-relevant pages too. This is due to the ad-hoc retrieval problem.
The aspects of ad-hoc retrieval that are addressed in IR research are as follows:
Information retrieval models predict and explain what a user will find in relevance to the given query. These are basically a pattern that defines the above-mentioned aspects of retrieval procedure that we discussed in ad-hoc retrieval and consists of the following:
Mathematically, a retrieval model consists of the following components:
The following are three models that are classified for the Information model (IR) model:
These are the simplest and easy-to-implement IR models. These are based on mathematical knowledge that was easily recognized and understood as well.
Following are the examples of classical IR models:
These are completely opposite to the classical IR models. These are based on principles other than similarity, probability, Boolean operations.
Following are the examples of Non-classical IR models:
It is the enhancement of the classical IR model that makes use of some specific techniques from some other fields.
Following are the examples of Alternative IR models:
Boolean Model is the oldest model for Information Retrieval (IR). These models are based on set theory and Boolean algebra, where
As a response to the query, the set of documents that satisfied the Boolean expression are retrieved.
The boolean model can be defined as:
D: It represents a set of words, i.e, the indexing terms present in a document. Here, each term is either present (1) or absent (0) in the document.
Q: It represents a Boolean expression, where terms are the index terms and operators are logical products such as:
F: It represents a Boolean algebra over sets of terms as well as over sets of documents.
If we talk about the relevance feedback, then in the Boolean IR model the Relevance prediction can be defined as follows:
R: A document is predicted as relevant to the query expression if and only if it satisfies the query expression as −
((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)
We can explain this model by a query term as an unambiguous definition of a set of documents.
For Example, suppose we have the query term “analytics”, which defines the set of documents that are indexed with the term “analytics”.
Now, think on what is the result after we combining terms with the Boolean ‘AND’ Operator?
After doing the ‘AND’ operation, it will define a document set that is smaller than or equal to the document sets of any of the single terms.
For Example, now we have the query with terms “Vidhya” and “analytics” that will produce the set of documents that are indexed with both the terms. In simple words, the document set with the intersection of both the sets described here.
Now, also think on what is the result after combining terms with the Boolean ‘OR’ operator?
After doing the ‘OR’ operation, it will define a document set that is bigger than or equal to the document sets of any of the single terms.
For Example, now we have the query with terms “Vidhya” or “analytics” that will produce the set of documents that are indexed with either the term “Vidhya” or “analytics”. In simple words, the document set with the union of both sets described here.
Following are the advantages of the Boolean model:
1. It is the simplest model based on sets.
2. It is easy to understand and implement.
3. It only retrieves exact matches.
4. It gives the user, a sense of control over the system.
Following are the disadvantages of the Boolean model:
1. The model’s similarity function is Boolean. Hence, there would be no partial matches. This can be annoying for the users.
2. In this model, the Boolean operator usage has much more influence than a critical word.
3. The query language is expressive, but it is complicated too.
4. There is no ranking for retrieved documents by the model.
As we have seen that there are some limitations in the Boolean model, so we have come up with a new model which is based on Luhn’s similarity criterion, which states that “the more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information”.
To understand more about the vector Space model, you have to understand the following points:
1. In this model, the index representations (documents) and the queries are represented by vectors in a T dimensional Euclidean space.
2. T represents the number of distinct terms used in the documents.
3. Each axis corresponds to one term.
4. Ranked list of documents ordered by similarity to the query where the similarity between a query and a document is computed using a metric on the respective vectors.
5. The similarity measure of a document vector to a query vector is usually the cosine of the angle between them.
Image Source: Google Images
The two common effective measures for evaluating IR systems are as follows:
Image Source: Google Images
Precision: Precision is the Proportion of retrieved documents that are relevant.
Recall: The recall is the Proportion of relevant documents that are retrieved.
Ideally both precision and recall should be 1. In practice, these are inversely related.
You can also check my previous blog posts.
Previous Data Science Blog posts.
Here is my Linkedin profile in case you want to connect with me. I’ll be happy to be connected with you.
For any queries, you can mail me on [email protected]
Thanks for reading!
I hope that you have enjoyed the article. If you like it, share it with your friends also. Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you. 😉
Very informative, really it is much helpful for understanding the basics of deep learning and machine learning.