Search Engines Using Deep Learning

Abhishek Jaiswal Last Updated : 22 Mar, 2022

7 min read

This article was published as a part of the Data Science Blogathon.

An end-to-end guide on building Information Retrieval system using NLP

Search Engines Using Deep Learning — Source: Naseerali MK

Hey Folks!!

I hope that you have enjoyed my last article that was all about Text-Generation using LSTM, if you haven’t read that yet prefer this link.

In this article, we will cover a very interesting yet tricky use case of NLP which is an Information retrieval system. Information retrieval system is a very widely used application of NLP.

In an information retrieval system, we will have various collections of documents and we need to search for a specific document by passing a context meaning.

Introduction
Word Embedding
Implementing IR system

_Introduction

Google has trillions of web pages, how Google efficiently searches relevant web pages for us without taking the page URL?.

An IR(information retrieval ) system allows us to search a document based on the meaningful information about that document in an efficient way.

As we know that two sentences can have very different structures and different words but they possibly can have the same meaning. In NLP our goal is to capture the meaning of the sentences, by using various NLP concepts which we will see in detail further in the article.

In IR system we use the context meaning to search documents.

Problem statement: Perform an Information retrieval system using context meaning.

Solution:

There could be multiple ways to perform Information retrieval. But the easiest yet very efficient way to do it using word embedding. word embedding takes the context meaning into consideration regardless of sentence structure.

Applications of IR system:

Document retrieval
Passage retrieval
Search engines
Question answering

_{Word Embedding}

I have written an article on various feature extraction techniques including word embedding implementation, if you haven’t read it, the link is here.

word embedding is capable of understanding the meaning of the sentence, regardless of the word structure.

For example

“I love him” and “I like him” will have the almost same meaning if we use word embedding

It is a predictive feature learning technique where words are mapped to the vectors using their contextual hierarchy.

Word Embedding | Search Engines Using Deep Learning

Source

As you see kitten and cat are placed very closely because they have very close meanings.

word embedding has been trained on more than 8 billion words using shallow neural networks, we will use the pre-trained word embedding vector for our task.

_{Implementing IR System}

We are going to implement the information retrieval system using python.

While implementing the information retrieval system there are some steps that we need to follow:-

Getting the data
Cleaning the data
Loading pre-trained word2vec
Getting the context-meaning of docs
Comparing queries with documents

1 Getting the Document

we have created our own dataset containing 4 documents, for better understanding, we will use a small dataset.

Doc1 = ["""Wasting natural resources is a very serious problem, since we all know that natural resources are limited but still we dont feel it. We use it more than we need, Government are also encouraging people to save the natural resoucres""" ]
Doc2 = ["""Machine learning is now days a very popular field of study, a continuous reasearch is going on this , Machine learning is all about maths. Analysing the patters solve the task."""]
Doc3 = ["""Nowdays books are loosing its charm since the phones and smart gadgets has taken over the old time, Books are now printed in Digital ways, This saves papers and ultimately saves thousands of trees"""]
Doc4 = ["""The man behind the wicket is MS DHONI , his fast hand behind wicket are a big advantage for india, but pant has now carrying the legacy of DHONI but he is not that fast"""]

here we have 4 documents and our dataset will be a list of documents separated by a comma.

#------merging all the documents-------
data = doc1+doc2+doc3+doc4
print(data)

2 Importing Necessary Libraries

We are importing all the dependencies that gonna be in use at once.

import numpy as np
import nltk
import itertools
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import scipy
from scipy import spatial
import re
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

3 Data Cleaning

In NLP data cleaning always generalize our training and promise better results. It’s always a good practice to perform data cleaning after loading the data.

making a function for data cleaning

def remove_stopwords(text, is_lower_case=False):
    pattern = r'[^a-zA-z0-9s]'
    text = re.sub(pattern," ",''.join(text))
    tokens = tokenizer.tokenize(text)
    tokens = [tok.strip() for tok in tokens]
    if is_lower_case:
        cleaned_token = [tok for tok in tokens if tok not in stopword_list]
    else:
        cleaned_tokens = [tok for tok in tokens if tok.lower() not in stopword_list]
    filtered_text = ' '.join(cleaned_tokens)
    return filtered_text

The function remove_stopwords takes documents one by one and returns the cleaned document.

we first removed all the unnecessary characters using regular expressions.
After removing the unnecessary characters we tokenized the filtered word and used stopword_list to filter out all the stopwords.

4 Implementing Word Embedding

We will be using a pre-trained word vector of 300 dimensions. you can download the word vector using this link. I would suggest you create a notebook on Kaggle is a better option if you don’t want to download the big file of world-vectors.

Loading the word vectors

the word vector is a text file containing the words and their corresponding vectors.

glove_vectors = dict()
file = open('../input/glove6b/glove.6B.300d.txt', encoding = 'utf-8')
for line in file:
    values = line.split()
    word = values[0]
    vectors = np.asarray(values[1:])
    glove_vectors[word] = vectors
file.close()

glove_vector It’s a dictionary containing words as keys and values as feature vectors.
glove.6B.300dThis word embedding is trained on 6Billions of words and the feature vectors length is 300.
If a word is not present in our word vector dictionary it will return a zero vector of 300 lengths.

example:

glove_vectors['cat']

deep

Source: Author

The returned feature vector for the cat is having 300 feature values.

Creating a feature vector for our document using word-embeddings.

We are creating a function that takes a sentence and returns the feature vector of 300 dimensions.

vec_dimension = 300
def get_embedding(x):
    arr  = np.zeros(vec_dimension)
    text = str(x).split()
    for t in text:
        try:
            vec = glove_vectors.get(t).astype(float)
            arr = arr + vec
        except:
            pass
    arr = arr.reshape(1,-1)[0]
    return(arr/len(text))

5 Summarizing the Document

A document contains many sentences and a sentence has a lot of vectors based on the number of words present in that sentence.

To sum up the meaning of a document, we need to average the meaning of all the sentences inside that document.

In the language of NLP, we average all the feature vectors of all sentences inside that document.

out_dict = {}
for sen in fin:
    average_vector = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(remove_stopwords(sen))]), axis=0))
    dict = { sen : (average_vector) }
    out_dict.update(dict)

The dictionary out_dict contains documents as key and their average feature vectors as their corresponding values.
The average feature vectors encode the meaning of a document into vectors.

6 Comparing the Meaning of a Context

As of now, we have the feature vector of each and every document and we need to build a function that can compare the feature vectors based on their meaning.

The similarity of two words can be calculated by using cosine-distance between their feature vectors.

def get_sim(query_embedding, average_vector_doc):
    sim = [(1 - scipy.spatial.distance.cosine(query_embedding, 
    average_vector_doc))]
    return sim

the function get_sim takes feature-vector of queries and feature-vector of all documents and returns the similarity between query and document.
If the similaritysim is higher, closer to 1 we can say it has almost the same meaning.
If the similarity simis closer to 0, we can say their meaning is different.
the package scipyprovides the class scipy.spatial.distance.cosine for calculating the cosine distance.

7 Making Queries

We are ready with feature vectors of documents as well as we have created the function to compare the similarity of two feature vectors.

def Ranked_documents(query):
    query_word_vectors = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(query.lower())],dtype=float), axis=0))
    rank = []
    for k,v in out_dict.items():
        rank.append((k, get_sim(query_word_vectors, v)))
        rank = sorted(rank,key=lambda t: t[1], reverse=True)
    print('Ranked Documents :')
    return rank[0]

query_word_vectors It holds the feature vectors of the input query.
The function Ranked_document compares the similarity between feature vectors of the query with the documents feature vector iteratively.
we are returning only one most similar document as our output.

8 Results

When we call the Ranked_document function by passing a query the function will return the one most relevant document related to our query.

Example1:

Ranked_documents('bat and ball')

As you see bat and ball is related to cricket when we searched this keyword the model returns the document related to cricket. Moreover, In the document, there is no bat and ball word present.

Example2:

Ranked_documents(‘computer science’)

As we all know that computer science is related to machine learning. when we queried the keyword computer science it returned the document related to computer science.

Conclusion

In this article, we discussed the document retrieval system, which is based on the comparison of the meaning of a document /sentence by comparing the feature vectors.

We Implemented the document retrieval system using python and pre-trained word embedding. There are multiple ways to perform document matching but using word-embedding and scipy’s cosine distance it becomes so easier to implement.

Hope you liked my article on search engines using deep learning?

You can copy and download the codes used in this article using this link.

Feel free to hit me on Linkedin if you have any queries or suggestions for me.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Abhishek Jaiswal

A data enthusiast exploring the leading technologies related to the data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Search Engines Using Deep Learning

An end-to-end guide on building Information Retrieval system using NLP

Table of Contents

Introduction

Word Embedding

Implementing IR System

1 Getting the Document

2 Importing Necessary Libraries

3 Data Cleaning

4 Implementing Word Embedding

5 Summarizing the Document

6 Comparing the Meaning of a Context

7 Making Queries

8 Results

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

_Introduction

_{Word Embedding}

_{Implementing IR System}