How to Compute and Store Vector Embeddings with LangChain?

Santhosh Reddy Dandavolu Last Updated : 05 Aug, 2024

8 min read

Introduction

In our previous articles, we have discussed loading different types of data and different ways of splitting that data. The data is split to find the relevant content to the query from all the data. Now, we’re leaping into the future of data retrieval. This article will explore the cutting-edge technique of using vector embeddings with LangChain to find content that closely matches your query efficiently. Join us as we uncover how this powerful approach transforms data handling, making searches faster, more accurate, and intuitive.

Computing and Storing Vector Embeddings with LangChain

Overview

Learn the fundamentals of text embeddings, including representing words and sentences as numerical vectors to capture semantic meanings.
Gain practical experience using LangChain’s and hugging face embedding models to compute and compare sentence embeddings.
Explore how to efficiently store and retrieve relevant documents using vector databases using Approximate Nearest Neighbor algorithms.
Understand LangChain’s indexing modes and learn to effectively manage document updates and deletions to maintain an optimal vector database.

Sentence Embeddings
Let’s Create LangChain Documents
Embeddings with LangChain
Using Vector Store
Indexing
Frequently Asked Questions

Sentence Embeddings

Before using embedding models from LangChain, let’s briefly review what embeddings are in the context of text.

To perform any computation with the text, we need to convert it into numerical form. Since all words are inherently related to each other, we can represent them as vectors of numbers that capture their semantic meanings. For example, the distance between the vectors representing two synonyms is smaller for synonyms and higher for antonyms. This is typically done using models like BERT.

Since the number of sentences is vastly higher than the number of words, we can’t calculate the embeddings for each sentence in the way we calculate the word embeddings. Sentence embeddings are calculated using SentenceBERT models, which use the Siamese network. For more details, read Sentence Embedding.

Let’s Create LangChain Documents

Requirements

Install langchain_openai, langchain-huggingface, and langchain-chroma packages using pip in addition to langchain and langchain_community libraries. Make sure to add the OpenAI API key to use OpenAI embedding models.

pip install langchain_openai langchain-huggingface langchain-chroma langchain langchain_community

Example: Creating LangChain Documents

I am using a few example sentences and a query sentence to explain the topics in this article. Later, let us also create LangChain documents using the sentences and categories.

from langchain_core.documents import Document
sentences = [
    "The Milky Way galaxy contains billions of stars, each potentially hosting its own planetary system.",
    "Photosynthesis is a process used by plants to convert light energy into chemical energy.",
    "The principles of supply and demand are fundamental to understanding market economies.",
    "In calculus, the derivative represents the rate of change of a function with respect to a variable.",
    "Quantum mechanics explores the behavior of particles at the smallest scales, where classical physics no longer applies.",
    "Enzymes are biological catalysts that speed up chemical reactions in living organisms.",
    "Game theory is a mathematical framework used for analyzing strategic interactions between rational decision-makers.",
    "The double helix structure of DNA was discovered by Watson and Crick in 1953, revolutionizing biology."
]
categories = ["Astronomy", "Biology", "Economics", "Mathematics", "Physics", "Biochemistry", "Mathematics", "Biology"]
query = 'Plants use sunlight to create energy through a process called photosynthesis.'
documents = []
for i, sentence in enumerate(sentences):
    documents.append(Document(page_content=sentence, metadata={'source': categories[i]}))
# creating Documents with metadata where 'source' is the category
# The documents will be as follows

Embeddings with LangChain

Let us initialize the embedding model and embed the sentences.

import os
from dotenv import load_dotenv
load_dotenv(api_keys.env)

from langchain_openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(model='text-embedding-3-small', show_progress_bar=True)
embeddings = embedding_model.embed_documents(sentences)
# check the total number of embeddings and embedding dimension
len(embeddings)
>>> 8
len(embeddings[0])
>>> 1536

Now, let’s calculate the cosine similarity of sentences with each other and plot them as heat maps.

import numpy as np
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)
sns.heatmap(similarities, cmap='magma', center=None, annot=True, fmt='.2f', yticklabels=sentences)

As we can see, sentences that belong to the same category and are similar to each other have a higher correlation to each other than with others.

Let’s compute the cosine similarities of these sentences w.r.t. the query sentence. We can find the most similar sentence to the query sentence.

query_embedding = embedding_model.embed_query(query)
query_similarities = cosine_similarity(X=[query_embedding], Y=embeddings)
#arrange the sentences in the descending order of their similarity with query sentence
for i in np.argsort(similarities[0])[::-1]:
    print(format(query_similarities[0][i], '.3f'), sentences[i])
"0.711 Photosynthesis is a process used by plants to convert light energy into chemical energy.
0.206 Enzymes are biological catalysts that speed up chemical reactions in living organisms.
0.172 The Milky Way galaxy contains billions of stars, each potentially hosting its own planetary system.
0.104 The double helix structure of DNA was discovered by Watson and Crick in 1953, revolutionizing biology.
0.100 Quantum mechanics explores the behavior of particles at the smallest scales, where classical physics no longer applies.
0.098 The principles of supply and demand are fundamental to understanding market economies.
0.067 Game theory is a mathematical framework used for analyzing strategic interactions between rational decision-makers.
0.053 In calculus, the derivative represents the rate of change of a function with respect to a variable.""

We can run them locally because embedding models require much less computing power than LLMs. Let’s run an open-source embedding model. We can compare and choose models from Huggingface.

from langchain_huggingface import HuggingFaceEmbeddings
hf_embedding_model = HuggingFaceEmbeddings(model_name='Snowflake/snowflake-arctic-embed-m')
embeddings = hf_embedding_model.embed_documents(sentences)
len(embeddings)
>>> 8
len(embeddings[0])
>>> 768

We can calculate the query embeddings and compute similarities with sentences as we have done above.

An important thing to note is each embedding model is likely trained on different data, so the vector embeddings of each model are likely in different vector spaces. So, if we embed the sentences and query with different embedding models, the results can be inaccurate even if the embedding dimensions are the same.

Using Vector Store

In the above example of finding similar sentences, we have compared the query embedding with each sentence embedding. However, if we have to find relevant documents from millions of documents, comparing query embedding with each will take a lot of time. We can use approximate nearest neighbors algorithms using a vector database to find the most efficient solution. To find out how those algorithms work, please refer to ANN Algorithms in Vector Databases.

Let us store the above example sentences in the vector store.

from langchain_chroma import Chroma
db = Chroma.from_texts(texts=sentences, embedding=embedding_model, persist_directory='./sentences_db',
                       collection_name='example', collection_metadata={"hnsw:space": "cosine"})

Code explanation

Chroma.from_texts: This is to create a database using texts. Chroma.from_documents can be used to use LangChain documents.
embedding: This is an embedding model loaded through LangChain.
persist_directory: By adding a location, we can save the database to load it later and avoid computing embeddings again.
collection_name: Name for the collection of documents we are storing. Make sure the directory name and collection name are different.
collection_metadata: This specifies the distance metric used for comparing embeddings. Other options are ‘l2’ (l2 norm) and ‘ip’ (inner product)

Some of the methods we can use on the database are as follows

# this will get the ids of the documents
ids = db.get()['ids']
# can used to get other data about the documents
db.get(include=['embeddings', 'metadatas', 'documents'])
# this can be used to delete documents by id.
db._collection.delete(ids=ids)
# this is to delete the collection
db.delete_collection()

Now, we can search the database with the query and get the distance between the query embedding and sentences.

db.similarity_search_with_score(query=query, k=3)
# the above will get the documents with distance metric. 
# If we want similarity scores we can use
db.similarity_search_with_relevance_scores(query=query, k=3)

To get the relevance scores within 0 to 1 for the ‘l2’ distance metric, we need to pass the relevance_score_fn

db = Chroma.from_texts(texts=sentences, embedding=embedding_model, relevance_score_fn=lambda distance: 1.0 - distance / 2,
                       collection_name='sentences', collection_metadata={"hnsw:space": "l2"})

In the above code, we have only used the LangChain library. We can also directly use chromadb as follows:

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

embedding_function = OpenAIEmbeddingFunction(api_key=os.environ.get('OPENAI_API_KEY'), model_name='text-embedding-3-small')
# initialize the client
client = chromadb.PersistentClient('./sentence_db')
# create a collection if it didn't exist otherwise load it
collection = client.get_or_create_collection('example', metadata={'hnsw:space': 'cosine'}, embedding_function=embedding_function)
# add sentences with any ids
collection.add(ids=ids, documents=sentences)

# now initialize the database collection to query
db = Chroma(client=client, collection_name='example', embedding_function=embedding_model)
# note that the embedding_function parameter here needs the embedding model loaded using langchain

Loading the vector database from a saved directory is as follows:

db2 = Chroma(persist_directory="./sentences_db/", collection_name='example', embedding_function=embedding_model)

# make sure you mention the collection name that you have used while creating the database.

# we can search this database as we have previously.

Sometimes, we may accidentally run the add documents code again, which will add the same documents to the database, creating unnecessary duplicates. We may also need to update some documents and delete all but a few.

For all of that, we can use Indexing API from LangChain

Indexing

LangChain indexing uses a Record Manager to track document entries in the vector store. Therefore, we can’t use Indexing for an existing database, as the record manager doesn’t track the database’s existing content.

When content is indexed, hashes are computed for each document, and the following information is stored in the Record Manager.

Document Hash: A hash of both the page content and its metadata.
Write Time: The timestamp when the document was written.
Source ID: Metadata that includes information to identify the ultimate source of the document.

Using those, we can avoid re-writing and re-computing embeddings over unchanged content. There are three modes of Indexing. None, Incremental, Full:

None mode avoids writing duplicate content to the vector store and doesn’t update or delete anything.
Incremental mode updates the database with new content and deletes old content for a given source.
Full mode deletes any content not found in the currently indexed content.

The categories we added as sources in the metadata when creating documents will be useful here.

from langchain.indexes import SQLRecordManager, index

# initialize the database
db = Chroma(persist_directory='./sentence_db', collection_name='example', 
            embedding_function=embedding_model, collection_metadata={"hnsw:space": "cosine"})
# name a namespace that indicates database and collection
namespace = f"db/example"

record_manager = SQLRecordManager(namespace, db_url="sqlite:///record_manager.sql")
record_manager.create_schema()

# load and index the documents 
index(docs_source=documents, record_manager=record_manager, vector_store=db, cleanup=None)
>>> {'num_added': 8, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

If we run the last line of code again, we will see num_skipped as 8 and all others as 0.

doc1 = Document(page_content="The human immune system protects the body from infections by identifying and destroying pathogens", 
                metadata={"source": "Biology"})
doc2 = Document(page_content="Genetic mutations can lead to variations in traits, which may be beneficial, neutral, or harmful", 
                metadata={"source": "Biology"})
# add these docs to the database in incremental mode
index(docs_source=[doc1, doc2], record_manager=record_manager, vector_store=db, cleanup='incremental', source_id_key='source')
>>> {'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

With the above code, both are added, and none are deleted, as the previous 8 documents are not added in incremental mode. If the previous 8 were added in incremental mode, then 2 documents would be deleted, and 2 would be added.

If we slightly change doc2 and rerun the indexing code, doc1 will be skipped as it is not changed, the existing doc2 will be deleted from the database, and the changed doc2 will be added. This is denoted as {‘num_added’: 1, ‘num_updated’: 0, ‘num_skipped’: 1, ‘num_deleted’: 1}

In the full mode, if we don’t mention any docs while indexing, all the existing docs will be deleted.

index([], record_manager=record_manager, vector_store=db, cleanup="full", source_id_key="source")
>>> {'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 10}
# this will delete all the documents in database

So, by using different modes, we can efficiently manage what data to keep and what to update and delete. Practice this indexing with different combinations of modes to understand it better.

Conclusion

Here, we have demonstrated how to efficiently find content similar to a query using vector embeddings with LangChain. We can achieve accurate and scalable content retrieval by leveraging embedding models and vector databases. Additionally, LangChain’s indexing capabilities allow for effective management of document updates and deletions, ensuring optimal database performance and relevance.

In the next article, we will discuss different ways of retrieving the content to send to the LLM.

Frequently Asked Questions

Q1. What are text embeddings, and why are they important?

Ans. Text embeddings are numerical representations of text that capture semantic meanings. They are important because they allow us to perform computations with text, such as finding similar sentences or words based on their meanings.

Q2. How does LangChain help create and use embeddings?

Ans. LangChain provides tools to initialize embedding models and compute embeddings for sentences. It also facilitates comparing these embeddings using similarity measures like cosine similarity, enabling efficient content retrieval.

Q3. What is the role of vector databases in content retrieval?

Ans. Vector databases store embeddings and use Approximate Nearest Neighbors (ANN) algorithms to quickly find relevant documents from a large dataset. This makes the retrieval process much faster and scalable.

Q4. How does LangChain’s indexing feature improve database management?

Ans. LangChain’s indexing feature uses a Record Manager to track document entries and manage updates and deletions. It offers different modes (None, Incremental, Full) to handle duplicate content, updates, and clean-ups efficiently, ensuring the database remains accurate and up-to-date.

Santhosh Reddy Dandavolu

I am working as an Associate Data Scientist at Analytics Vidhya, a platform dedicated to building the Data Science ecosystem. My interests lie in the fields of Natural Language Processing (NLP), Deep Learning, and AI Agents.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

How to Compute and Store Vector Embeddings with LangChain?

Introduction

Overview

Table of contents

Sentence Embeddings

Let’s Create LangChain Documents

Requirements

Example: Creating LangChain Documents

Embeddings with LangChain

Using Vector Store

Indexing

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set