With the rise of AI applications and use cases, there has been an increased flow of various tools and technologies to facilitate such AI applications and allow AI developers to build real-world applications. Among such tools, today we will learn about the workings and functions of ChromaDB, an open-source vector database to store embeddings from AI models such as GPT3.5, GPT-4, or any other OS model. Embedding is a crucial component of any AI application pipeline. As computers only process vectors, all the data must be vectorized in the form of embeddings to be used in semantic search applications.
So let’s dive deeper into the working of ChromDB with hands-on code examples!
This article was published as a part of the Data Science Blogathon.
ChromaDB is an open-source vector database designed to store vector embeddings to develop and build large language model applications. The database makes it simpler to store knowledge, skills, and facts for LLM applications.
ChromaDB is an open-source vector database designed to store vector embeddings to develop and build large language model applications. The database makes it simpler to store knowledge, skills, and facts for LLM applications.
The above Diagram shows the workings of chromaDB when integrated with any LLM application. ChromaDB gives us a tool to perform the following functions:
ChromaDB is super simple to use and set up with any LLM-powered application. It is designed to boost developer productivity, making it a developer-friendly tool.
Now, let’s install ChromaDB in the Python and Javascript environments. It can also run in Jupyter Notebook, allowing data scientists and Machine learning engineers to experiment with LLM models.
Python Installation
# install chromadb in the Python environment
pip install chromadb
Javascript Installation
# install chromadb in JS environment
npm install --save chromadb # yarn add chromadb
After the installation of the library, we will learn about various functions of it in the next sections.
We can use a Jupyter Notebook environment like Google Colab for our demo purposes. You can either do the following hands-on exercises in a Google Colab, Kaggle, or Local notebook environment.
# import chromadb and create a client
import chromadb
client = chromadb.Client()
collection = client.create_collection("my-collection")
In the above code, we have instantiated the client object to create the “my-collection” collection in the repository folder.
The collection is where embeddings, documents, and any additional metadata are stored to query later for various applications.
# add the documents in the db
collection.add(
documents=["This is a document about cat", "This is a document about car",
"This is a document about bike"],
metadatas=[{"category": "animal"}, {"category": "vehicle"},
{"category": "vehicle"}],
ids=["id1", "id2","id3"]
)
Now, we have added a few of the sample documents along with metadata and ids to store them in a structured manner.
ChromaDB will store the text documents and handle tokenization, vectorization, and indexing automatically without any extra commands.
# ask the querying to retrieve the data from DB
results = collection.query(
query_texts=["vehicle"],
n_results=1
)
------------------------------[Results]-------------------------------------
{'ids': [['id2']],
'embeddings': None,
'documents': [['This is a document about car']],
'metadatas': [[{'category': 'vehicle'}]],
'distances': [[0.8069301247596741]]}
By simply calling the ‘query()’ function on the collection database, it will return the most similar text based on the input query with their metadata and ids. In our example, the query returns similar text containing ‘vehicle’ metadata.
Semantic search is one of the most popular applications in the technology industry and is used in web searches by Google, Baidu, etc. Language models now allow the development of such applications at an individual level or for a business organization with embeddings of a huge amount of data.
We will use the “pets” folder with a few sample documents to work around the semantic search application in ChromaDB. We have the following files in a local folder:
Let’s import files from the local folder and store them in “file_data”.
# import files from the pets folder to store in VectorDB
import os
def read_files_from_folder(folder_path):
file_data = []
for file_name in os.listdir(folder_path):
if file_name.endswith(".txt"):
with open(os.path.join(folder_path, file_name), 'r') as file:
content = file.read()
file_data.append({"file_name": file_name, "content": content})
return file_data
folder_path = "/content/pets"
file_data = read_files_from_folder(folder_path)
The above code takes files from the “pets” folder and appends them in a “file_data” as a list of all the files. we will use these files to store in ChromaDB as embeddings for querying purposes.
# get the data from file_data and create chromadb collection
documents = []
metadatas = []
ids = []
for index, data in enumerate(file_data):
documents.append(data['content'])
metadatas.append({'source': data['file_name']})
ids.append(str(index + 1))
# create a collection of pet files
pet_collection = client.create_collection("pet_collection")
# Add files to the chromadb collection
pet_collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
The above code takes files, and metadata from the list of files and adds them into the chromaDB collection called “pet_collection”.
Here we need to take note that by default chromadb uses the “all-MiniLM-L6-v2″ embedding model from sentence transformers which converts text documents into vectors. Now, let’s query the collection to see the results.
# query the database to get the answer from vectorized data
results = pet_collection.query(
query_texts=["What is the Nutrition needs of the pet animals?"],
n_results=1
)
results
As we query the collection, it automatically finds the most similar document for our query from the embedded documents which then resulted in an output. we can also see the distance metric in the output which shows how close the certain document was to our query.
So far we have used the default embedding model for the vectorization of input texts but ChromaDB allows various other models from the sentence transformer library as well. we will use the “paraphrase-MiniLM-L3-v2” model to embed the same pets document for our semantic search application.
(Note: Please install the sentence_transformers library before executing the below code, if you haven’t)
# import the sentence transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L3-v2')
documents = []
embeddings = []
metadatas = []
ids = []
# enumerate through file_data to collection each document and metadata
for index, data in enumerate(file_data):
documents.append(data['content'])
embedding = model.encode(data['content']).tolist()
embeddings.append(embedding)
metadatas.append({'source': data['file_name']})
ids.append(str(index + 1))
# create the new chromaDB and use embeddings to add and query data
pet_collection_emb = client.create_collection("pet_collection_emb")
# add the pets files into the pet_collection_emb database
pet_collection_emb.add(
documents=documents,
embeddings=embeddings,
metadatas=metadatas,
ids=ids
)
The above code uses the “paraphrase-MiniLM-L3-v2” model to encode the input files while adding to the new collection.
Now, we can query the database again to get the most similar results.
# write text query and submit to the collection
query = "What are the different kinds of pets people commonly own?"
input_em = model.encode(query).tolist()
results = pet_collection_emb.query(
query_embeddings=[input_em],
n_results=1
)
results
Embeddings are the native way to store all kinds of data for AI applications. They can represent text, images, audio, and video data as per the requirements of the applications.
ChromaDB supports many AI models from different embedding providers, such as OpenAI, Sentence transformers, Cohere, and the Google PaLM API. Let’s look at some of them here.
# loading any model from sentence transformer library
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2")
Using the above code, we can use any model from the available models. you can find the list of models here.
ChromaDB provides a wrapper function to use any embedding model API from OpenAI for AI applications
# function to call OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="YOUR_API_KEY",
model_name="text-embedding-ada-002"
)
For more detailed information on ChromaDB functions, please visit their official documentation here.
Github code repository: Click Here
In conclusion, vector databases are the key building blocks for Generative AI applications. ChromaDB is one such vector database that is increasingly used in a wide range of LLM-based applications. In this blog, we learned about ChromaDb’s various functions and workings using the code example.
A. ChromaDB is an AI-native open-source database designed to be used for LLM bases applications to make knowledge, and skills pluggable for LLMs.
A.Yes, ChromaDB is free to use for any personal or commercial purpose under Apache 2.0 license.
A. ChromaDB is flexible in its nature. It works for in-memory as well as embedded configuration for any LLM-based application.
A.ChromaDB is a vector database that stores the data in an embedding form while LangChain is a framework to load large amounts of data for any use-case.
A. ChromaDB supports sentence transformers models, OpenAI APIs, and Cohere or any other OS model to store embeddings.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.