The introduction to tools like LangChain, and LangFlow, has made things easier when building applications with Large Language Models. Though building applications and choosing different Large Language Models has become easier, the data uploading part, where the data comes from various sources is still time-consuming for developers while developing LLM-powered applications as the developers need to convert data from these various sources into plain text before injecting them into vector stores. This is where Embedchain comes in, which makes it simple to upload data of any data type and start querying the LLM instantly. In this article, we will explore how to get started with embedchain.
This article was published as a part of the Data Science Blogathon.
Embedchain is a Python/Javascript library, with which a developer can connect multiple data sources with Large Language Models seamlessly. Embedchain allows us to upload, index, and retrieve unstructured data. The unstructured data can be of any type like a text, a URL to a website/YouTube video, an Image, etc.
Emdechain makes it simple to upload these unstructured data with a single command, thus creating vector embeddings for them and starting querying instantly with the data with the connected LLM. Behind the scenes, embedchain takes care of loading the data from the source, chunking it, then creating vector embeddings for it, and finally storing them in a vector store.
In this section, we will install the embedchain package and create an app with it. The first step would be using the pip command to install the package as shown below:
!pip install embedchain
!pip install embedchain[huggingface-hub]
Now we will be creating an environment variable to store the Hugging Face Inference API Token as below. We can obtain the Inference API Token by signing in to the Hugging Face website and then generating a token.
import os
os.environ["HUGGINGFACE_ACCESS_TOKEN"] = "Hugging Face Inferenece API Token"
The embedchain library will use the token provided above to infer the hugging face models. Next, we must create a YAML file defining the model we want to use from huggingface. A YAML file can be considered as a simple key-value store where we define the configurations for our LLM applications. These configurations can include what LLM model we are going to use or what Embedding Model we are going to use(To learn more about the YAML file please click here). Below is an example YAML file
config = """
llm:
provider: huggingface
config:
model: 'google/flan-t5-xxl'
temperature: 0.7
max_tokens: 1000
top_p: 0.8
embedder:
provider: huggingface
config:
model: 'sentence-transformers/all-mpnet-base-v2'
"""
with open('huggingface_model.yaml', 'w') as file:
file.write(config)
Next, we will create an app with the above YAML configuration file.
from embedchain import Pipeline as App
app = App.from_config(yaml_path="huggingface_model.yaml")
app.add("https://en.wikipedia.org/wiki/Alphabet_Inc.")
Let’s query our App based on the uploaded data:
In the above Image, using the query() method, we have asked our App i.e. the flan-t5 model two questions related to the data that was added to the App. The model was able to answer them correctly. This way, we can add multiple data sources to the model by passing them to the add() method and internally they will be processed and the embeddings will be created for them, and finally will be added to the vector store. Then we can query the data with the query() method.
In the previous example, we have seen how to prepare an application that adds a website as the data and the Hugging Face Model as the underlying Large Language Model for the App. In this section, we will see how we can use other models and other vector databases to see how flexible the embedchain can be. For this example, we will be using Zilliz Cloud as our Vector Database, hence we need to download the respective Python client as shown below:
!pip install --upgrade embedchain[milvus]
!pip install pytube
After creating the Cluster we can obtain the credentials to connect to it as shown below:
Copy the Public Endpoint and the Token and store these somewhere else, as these will be needed to connect to the Zilliz Cloud Vector Store. And now for the Large Language Model, this time we will use the OpenAI GPT model. So we will also need the OpenAI API Key to move forward. After obtaining all keys, create the environment variables as shown below:
os.environ["OPENAI_API_KEY"]="Your OpenAI API Key"
os.environ["ZILLIZ_CLOUD_TOKEN"]= "Your Zilliz Cloud Token"
os.environ["ZILLIZ_CLOUD_URI"]= "Your Zilliz Cloud Public Endpoint"
The above will store all the required credentials to the Zilliz Cloud and OpenAI as environment variables. Now it’s the time to define our app, which can be done as follows:
from embedchain.vectordb.zilliz import ZillizVectorDB
app = App(db=ZillizVectorDB())
app.add("https://www.youtube.com/watch?v=ZnEgvGPMRXA")
Now, the video is first converted to text, next it will be created into chunks and will be converted into vector embeddings by the OpenAI embedding model. These embeddings will then be stored inside the Zilliz Cloud. If we go to the Zilliz Cloud and check inside our cluster, we can find a new collected named “embedchain_store”, where all the data that we add to our app is stored:
As we can see, a new collection was created under the name “embedchain_store” and this collection contains the data that we have added in the previous step. Now we will query our app.
The video that was added to the app is about the new Windows 11 update. In the above image, we ask the app a question that was mentioned in the video. And the app correctly answers the question. In these two examples, we have seen how to use different Large Language Models and different databases with embedchain and have also uploaded data of different types, i.e. a webpage and a YouTube video.
Embedchain has been growing a lot since it was released by bringing in support for a large variety of Large Language Models and Vector Databases. The supported Large Language Models can be seen below:
Apart from supporting a wide range of Large Language Models, the embedchain also provides support to many vector databases that can seen in the below list:
Apart from these, the embedchain in the future will be adding support for more Large Language Models and Vector Databases.
While building applications with large language models, the main challenge will be when dealing with data, that is dealing with data coming from different data sources. All the data sources eventually need to be converted into a single type before being converted into embeddings. And every data source has its own way of handling it like there exists separate libraries for handling videos, others for handling websites, and so on. So, we have taken a look at a solution for this challenge with the Embedchain Python Package, which does all the heavy lifting for us, thus allowing us to integrate data from any data source without worrying about the underlying conversion.
Some of the key takeaways from this article include:
A. Embedchain is a Python tool that allows users to add in data of any type and get it stored in a Vector Store thus allowing us to query it with any Large Language Model.
A. A vector database of our choice can be given to the app we are developing either through the config.yaml file or directly to the App() class by passing the database to the “db” parameter inside the App() class.
A. Yes, in the case of using local vector databases like chromadb, when we perform an add() method, the data will be converted into vector embeddings and then be stored in a vector database like chromadb which will be persisted locally under the folder “db”.
A. No, it is not. We can configure our application by directly passing the configurations to the App() variables or instead use a config.yaml to generate an App from the YAML file. Config.yaml file will be useful to replicate the results / when we want to share the configuration of our application with someone else but it is not mandatory to use one.
A. Embedchain supports data coming from different data sources which include CSV, JSON, Notion, mdx files, docx, web pages, YouTube videos, pdfs, and many more. Embedchain abstracts away the way it handles all these data sources thus making it easier for us to add any data.
To learn more about the embedchain and its architecture please refer to their official documentation page and Github Repository.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.