“Google it!”- Isn’t it something we say every day?
Whenever we come across something that we don’t know about, we “Google it.” Google Search is a great tool that can be used for even finding a needle from a haystack. This generation absolutely relies on Google for answers to all kinds of problems they face. From personal complications to business solutions, Google guides you to the ultimate remedy.
But there is a catch. Let’s say you wish to search through your company’s documents, or suppose you have an e-commerce website where you want users to search for the products they need. What will you do? Will you use Google Search for that too? Absolutely Not! Because first we generally don’t want any other company’s involvement in our system and second we cannot put our private data on the internet for public access. Then, what’s the solution?
In these scenarios, you’ll need to create something of your own, and that’s where information retrieval for text data comes into play. In this article, we’ll learn about information retrieval, and create a project in which we’ll perform information retrieval using word2vec based vector space model. So, let’s start by understanding what information retrieval is.
Note: if you want to learn more about analyzing text data, refer to this NLP Master’s Program-
“Information Retrieval is the process of finding desired documents from a collection of documents.”
The way this works is that the user inputs his need in the form of text(query) in the information retrieval system. The system then processes this query and finds the relevant documents from the existing collection of documents(corpus). These relevant documents are then sent to the user in the decreasing order of relevance. In this whole process, the rank of documents returned determines how good or bad our results are.
For example, you go to an e-commerce website and search for an iPhone, and the website shows you the iPhone charger and back cover first and then shows you the smartphone. Now, tell me one thing, are charger and back cover related to the query?-Yes they’re somewhat. But, Are they most relevant to the query? No, the smartphone is most relevant to the input query, so it should be shown to the user first.
You see, in information retrieval problems, just returning relevant documents is not the task. Instead, you have to return the most relevant ones first, followed by less relevant documents. This is known as document ranking. There are multiple ways of ranking documents for a query, but in this article, we’ll only use the vector space model, which is an unsupervised method.
Since the machines cannot understand the text, we need to use numbers for representing queries and documents. And by numbers, I mean vectors – Yes, the same vectors that we read about in mathematics. There are multiple ways of generating vectors for representing documents and queries such as Bag of Words (BoW), Term Frequency (TF), Term Frequency and Inverse Document Frequency (TF-IDF), and others.
Here, I’ll use word2vec. As the name suggests, the word2vec means “word to vector,” and that’s exactly what it does-it converts words into vectors. One interesting thing about word2vec is that it can capture context and represent it using the vectors. Due to this, it is able to preserve the semantic and syntactic relationship between words.
You can read more about word2vec here: An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec.
Now we know that we’ll be using word2vec for creating vectors. Still, we don’t know what a vector space model is. So let’s understand it.
“The Vector Space Model works on the concept of similarity. It makes an assumption that relevance between a document and the query is directly related to the similarity between their vector representations.”
It means that a document(D1) with a higher similarity score with the query(Q) will be more relevant than document(D2) with a lower similarity score. This similarity score between the document and query vectors is known as cosine similarity score and is given by,
where D and Q are document and query vectors, respectively.
Now that we know about the vector space model, so let us again take a look at the diagram of the information retrieval system using word2vec.
You can see that I have expanded the processes involved in the information retrieval system which uses word2vec based vector space model. Now that you’re familiar with the processes involved, it’s time to start our project.
In the next section, we’ll create a project using our current understanding. After doing this project you’ll be able to do information retrieval using a word2vec based vector space model on your own.
Whenever we think of any data science project, data is the first thing that crosses our minds. In this section, we’ll talk about the dataset that we’ll be using in this project. But first I’d like to talk about TREC.
TREC stands for the Text Retrieval Conference. Started in 1992 it is a series of workshops that focus on supporting research within the information retrieval community. It provides the infrastructure necessary for large-scale evaluation of text retrieval methodologies. Every year these workshops are organized, which are centered around a set of tracks. These tracks encourage new researches in the area of information retrieval.
But why does this information matter here? It matters because TREC has a vast collection of datasets for information retrieval, and I too, have selected one for this project.
I have selected the document ranking dataset from TREC 2019 Deep Learning Track. The dataset contains 367k queries and a corpus of 3.2 million documents. This dataset is appropriate for our needs, but this can also be used for other techniques of information retrieval because this deep learning track is explicitly created for the scenarios where we have a large training set. You can read more about this dataset and how it was created over here.
We already know the size of the dataset is quite large; it has about 367k queries and 3.2 million documents. Moreover, there are other files too, and working with this much large dataset will require a lot of resources, especially RAM and time, and that’s something we don’t have much.
One possible solution is to reduce the size of the dataset. Since we are doing this for learning purposes, it is fine to reduce the dataset to fit our needs. In this way, we’ll be able to focus more on the concepts rather than scaling our IR system because trust me guys; that’s another challenge on its own.
Still, for reducing the dataset, we’ll have to load it somewhere first and then reduce it. And the size of the corpus alone is about 22 GB, which is not possible to fit in regular systems. There are multiple ways to overcome this problem: one of them is Dask. It is a wonderful open-source python library for parallel computing. It will help us divide and load the dataset in chunks and keep the memory consumption to a minimum.
Another thing I personally like about the Dask is that it uses existing Python APIs. Therefore, you don’t have to rewrite your code to scale up, and that makes it extremely beginner-friendly. If you are now excited to learn Dask, then go and read this article:
Let’s fire up our editor and start writing some code. First, we’ll download the files and import the required libraries. For this project, we only require three files “msmarco-docs.tsv”, “msmarco-doctrain-queries.tsv” and “msmarco-doctrain-top100”. The first file contains the documents, the second one contains queries, and the last one contains the top 100 documents ranked for each query.
Now that we have downloaded our files, we’ll now load the file containing queries.
import pandas as pd
queries_train=pd.read_table('msmarco-doctrain-queries.tsv',header=None)
queries_train.columns=['qid','query']
print('Shape=>',queries_train.shape)
print(queries_train.head())
You can see that this dataset contains two columns-one for storing the query IDs and another for storing the queries itself. Now we’ll reduce the number of queries to 2000. For this, I’ll randomly take a sample of size 2000.
We’ll now create two sets out of this, each of 1000 queries. One will be the training set, and other will be the testing set. You might be thinking that earlier I said the vector space model is an unsupervised method, and now I am creating a training set why? Yes, You’re right vector space model is an unsupervised method. It doesn’t require a training set.
Here, I am creating a train set for training the word2vec model. Still, if you want to skip this part then pre-trained models are also available, you can use them as well.
Now that we have reduced the queries, it’s time to reduce the ranked documents. We’ll first load the data containing the top 100 documents for each query.
This dataset has more than 36 million rows. It contains the top 100 documents ranked for each query according to their relevance with the query. The document at rank 1 is most relevant to the query and the document at 100 is least relevant among the 100 documents. Since we have already reduced the number of queries, so let’s reduce this one too.
We have reduced our data from 36 million rows to two sets of 100k rows in each. Still, it is quite big for our use. Therefore, we’ll now reduce it further. For this, I will label the documents at rank 1 to 10 as relevant(1) and from 91 to 100 as non-relevant(0). Doing this will benefit us in two ways. First, it will reduce the dataset, and second, it will act as a ground truth on which we’ll evaluate our method later.
Till now, we have reduced our queries and created ground truth. But we still have to reduce the biggest file, i.e., corpus. For handling this, I will use Dask dataframes. Also, I will create different corpora for both train and test sets by combining the documents present in their ground truth. So, let’s load the original corpus and create corpora for the train and test set.
Here, I have created partitions of 100MB for breaking up the whole dataset into chunks. Now, I will map the docid from result files with the docid in the corpus file. For doing this, I have created a function create_corpus().
Now, we have our datasets ready for further processes. Let’s move to the next section of this article.
One best practice to follow is to read their datasets. If it is tabular, take a look at some rows. If it contains images, then take a look at them, and if it is text, then sit down, take a few samples, and read them all. Just dedicate some time to it and read the dataset. It helps you in getting familiar with data and finding patterns in it, especially in text and image data. Therefore, we’ll now take a sample from the corpus and look at the data we have.
Due to the constraint of time and length, I have only randomly picked one of the documents, but you should read a few of them. Now, let’s take a look at some queries.
The first thing that we notice in the documents is that many words are capitalized. So we’ll have to convert them to lowercase and bring them down to the same level. Next, we see that the data also contains English contractions. So we’ll need to expand those. If we look at the document, we’ll find unnecessary things like symbols, URLs, digits, punctuations, words with digits, etc. Therefore, we’ll remove them and only keep the words.
Apart from these steps, you can also see that the documents are much noisy, i.e., they contain stopwords and multiple forms of the same word. For handling this, we’ll remove the stopwords from the documents and lemmatize them. You should keep in mind that neither we’ll remove the stopwords nor lemmatize the queries because they are already short in length, and removing these words will change their intent.
In the previous section, we looked at the data and figured out the required pre-processing tasks. In this section, we’ll execute those tasks. But first, let’s list down the steps we’ll be performing next. The pre-processing steps we’ll be performing on documents and queries are as follows:
We now have everything clear in our minds, so let’s start writing codes for pre-processing the documents.
If you noticed, I have used lambda functions above for applying a function to every row of the data. If you aren’t familiar with them, then you can read my article which explains lambda functions in a precise manner. I highly recommend you to read this because I’ll be using lambda functions a lot for the rest of the article.
Till now, we have converted our text to lowercase and expanded the English contractions. Next, we’ll clean the documents. For this, we’ll be using regular expressions. If you have ever used regular expressions before then, you might already be familiar with its power. If you don’t know about RegEx then here are a few articles which you can read:
For cleaning the documents, I have created a function clean_text() which will remove the words with digits, replace newline characters with space, remove URLs, and replace everything that isn’t English alphabets with space.
Previously, we replaced many things with blank spaces. This will create extra empty spaces between two words. Therefore, we’ll reduce the number of spaces to one.
We’re now done with the text cleaning part. Let’s now remove the stopwords from documents and lemmatize it. For this, we’ll be using SpaCy. It is a library for advanced Natural Language Processing in Python and Cython. You can read more about stopwords removal and lemmatization in this article:
We have now pre-processed our documents. It’s time to pre-process our queries.
We have successfully pre-processed both documents and queries. Now, it’s time to create vectors.
In this section, we’ll train our word2vec model and generate vectors for documents and queries in the testing set for information retrieval. But before that, we’ll prepare the dataset for training the word2vec model. Please note, we have already created the training set, but we want to use the same word2vec model for generating vectors for both documents and queries. That’s why we’ll combine both documents and queries to create a single file.
Now we’ll train our word2vec model and for that, we’ll use gensim. Gensim is a python package used for topic modeling, text processing, and working with word vector models such as Word2Vec and FastText. You can read more about working with word2vec in gensim here.
Here, I have trained a skip-gram word2vec model which tries to predict the context given the word. It will generate vectors of size 300. Let’s now take a look at the size of its vocabulary.
Since the word2vec provides vectors for a word, we’ll create a function get_embedding_w2v() for generating vectors for the whole document or query. This function will use the word2vec model and generate the vectors for each word in the document.
Then, it will take the average of the vectors, and the resulting vector will represent the vector for the document. Any document or query of length zero will have a vector containing zeroes, and any word which won’t be present in the vocabulary will have a vector with random values.
We have successfully trained our word2vec model and created vectors for documents and queries in the testing set for information retrieval. Now, it’s time to rank the documents according to the queries. But before that, we need to learn about the evaluation metric. Because only implementing a method won’t solve our purpose. We need a metric for checking the performance of our model.
The evaluation metric which we’ll be using here is Mean Average Precision(MAP@K). I know many of you might not have heard about this metric because the evaluation metrics used for information retrieval differ from traditional evaluation metrics.
So, let’s first understand what precision is. Suppose we have a model that labels some samples positive and some samples as negative. Now, precision tells us how effective our model is in labeling samples positive. Mathematically, it is given by,
Here, True positives are the positive samples that model labeled as positive and False positives are the actually negative samples that model labeled as positive. So, False positives are the samples that model mislabeled. Therefore, Precision can also be seen as the percentage of true positives among all the samples labeled as positive.
I am sure that you have read about it before in classification problems. But, there is a twist in the way it is used in information retrieval. Let’s understand it with an example.
Suppose our information retrieval model ranks the documents according to their relevance and returns the top 5 documents. According to our model, all the five documents returned are relevant to the query, but when we checked the ground truth, we found documents at rank 2 and 4 as non-relevant. Now, let’s try to measure this using precision.
In information retrieval, the meaning of precision still remains the same, but the way we draw results from it changes. Here, we calculate precision at a specific rank. This precision is denoted by P@K, where K is the rank at which precision was calculated.
Let’s calculate P@K for the above example. At rank 1, we have a relevant document. So, our precision(P@1) becomes one because there are no false positives here. If we had a non-relevant document at this position, then P@1 would have been 0.
Now, let’s move forward and calculate precision at rank 2(P@2). Here, we will consider both documents at rank 1 and rank 2. Since one is relevant and the other is non-relevant, the P@2 will be 0.5. If we take a look at rank 3 we’ll find that there are 2 relevant and 1 non-relevant documents till rank 3, i.e., there are 2 true positives and 1 false positive. Now, if you’ll calculate P@3, you’ll get 0.67.
Similarly, you can calculate P@4 and P@5, which will be 0.5 and 0.6, respectively. I am sure by now, you have understood the way we calculate P@K. Now, we have covered precision. Let’s now understand what average precision is.
Average precision is the average of precision but, while calculating it, we only take precision at those ranks where we have a relevant document. Let’s understand it by calculating average precision(AP@K) for the previous example. Here, we have non-relevant documents at rank 2 and 4. Therefore, we’ll skip their precisions while calculating the average precision. So, our average precision for this query will be,
Keep in mind that average precision is calculated for a query. Since we’re evaluating the model here for the top five documents returned for a query, the average precision can also be denoted as AP@5. Also, if there are no relevant documents returned for a query, then the average precision for that query becomes zero. Now, we know what precision and average precision are. Let’s now understand what Mean Average Precision(MAP@K) is.
Mean Average Precision is the mean of the average precision for all the queries used during evaluation. So, precision is calculated at each rank, average precision is calculated for a query, and mean average precision is calculated for the whole IR model. Now you know what MAP@K is. So, let’s write some code and calculate it for our vector space model.
For the ranking and evaluation, I have created a function average_precision(), which takes the query id and vector of a query as an input and returns the average precision value.
The value of MAP ranges between 0 and 1, with zero being the worst and one as best. Our information retrieval model performs well in the evaluation with a value of 0.798. We used MAP for evaluation because we had binary labels, but there are different types of datasets. And based on that, we have different types of evaluation metrics.
To know more about the different evaluation metrics for information retrieval, you can watch the following video lecture by Prof. Dr. Hannah Bast at the University of Freiburg.
Through evaluation, we know that our method works. But this is not enough. Till now, we don’t have any way of getting queries and ranking the documents.
In this section, we’ll create a function ranking_ir() that will take a query as an input and return the top 10 relevant documents. This function will follow the information retrieval(IR) pipeline. First, it will pre-process the query. Then, it will generate the vector for it. After that, it will rank the documents based on the similarity scores.
Since we have already created vectors for the documents in the corpus, we’ll not do that again. In the real-world, creating vectors for the documents consumes a lot of time and resources. Therefore, it is not performed again and again. Only the vector for queries is generated on the fly. The advanced IR systems don’t even do that. They store the vectors for the most frequent queries and use them again and again. That’s one of the reasons why you get extremely fast results on Google.
We have now created our function. Let’s run some queries on our system.
You can see that for the first query, the system puts documents talking about Michael Jordan’s net worth at the top because these documents also contain other biographical information. For the second query, the system sets the documents talking about NBA championships at the top. This shows that our model works well even on those queries which were not a part of the testing queries.
This is how Information Retrieval using word2vec based Vector Space Model works.
Cheers! We have completed our project. Now, you know how information retrieval works using word2vec. So what’re you waiting for pick an IR dataset, create a project on it, share it on social media and tag Analytics Vidhya. Also, you can share your projects in the comments below.
Till now, you used word2vec for information retrieval or ranking but, there are other methods too which you can check out especially, BM25(Best Match 25). It is quite famous and easy too. BM25 is also an unsupervised method used for information retrieval.
Apart from these, there are supervised methods too, which are very powerful. They are referred to as Learning-to-Rank. Make sure to read about those because they are generally combined with unsupervised methods where unsupervised methods are used for quick ranking. Then, supervised methods come into action and re-ranks the top documents.
I have also listed some resources for Information Retrieval, which you can use for more knowledge.
Apart from IR, if you are interested in other NLP tasks, then you can read the following articles:
this is one of the best article on information retrieval. great article abhishek
Hi Abhishek, This article is amazing, Congratulations.. How this model is useful and efficient in applying on PDF. In PDF multiple format of data is present like text, images, tables.. This kind of scenarios how?
This is a great article and I found it really useful in learning more about information retrieval. I have been trying to run the code in jupyter notebook and I keep getting the following error "ValueError: ('Lengths must match to compare', (20000,), (1,))" when calculating the average precision. Were you getting this error and if so what did you do to resolve it?