Introduction to Embedchain – A Data Platform Tailored for LLMs

Ajay Last Updated : 08 Nov, 2023

8 min read

Introduction

The introduction to tools like LangChain, and LangFlow, has made things easier when building applications with Large Language Models. Though building applications and choosing different Large Language Models has become easier, the data uploading part, where the data comes from various sources is still time-consuming for developers while developing LLM-powered applications as the developers need to convert data from these various sources into plain text before injecting them into vector stores. This is where Embedchain comes in, which makes it simple to upload data of any data type and start querying the LLM instantly. In this article, we will explore how to get started with embedchain.

Learning Objectives

Understanding the significance of embedchain in simplifying the process of managing and querying data for Large Language Models (LLMs)
Learn how to effectively integrate and upload unstructured data into embedchain, enabling developers to work with various data sources seamlessly
Knowing the different Large Language Models and Vector Stores supported by embedchain
Discover how to add various data sources, such as web pages and videos to the vector store, thus understanding the data ingestion

This article was published as a part of the Data Science Blogathon.

What is Embedchain?
Creating First App with Embedchain
Configuring App with a Different Model and Vector Store
Supported LLMs and Vector Stores by Embedchain
Frequently Asked Questions

What is Embedchain?

Embedchain is a Python/Javascript library, with which a developer can connect multiple data sources with Large Language Models seamlessly. Embedchain allows us to upload, index, and retrieve unstructured data. The unstructured data can be of any type like a text, a URL to a website/YouTube video, an Image, etc.

Emdechain makes it simple to upload these unstructured data with a single command, thus creating vector embeddings for them and starting querying instantly with the data with the connected LLM. Behind the scenes, embedchain takes care of loading the data from the source, chunking it, then creating vector embeddings for it, and finally storing them in a vector store.

Creating First App with Embedchain

In this section, we will install the embedchain package and create an app with it. The first step would be using the pip command to install the package as shown below:

!pip install embedchain

!pip install embedchain[huggingface-hub]

The first statement will install the embedchain Python Package
The next line will install the huggingface-hub, this Python Package is required if we want to use any models provided by the hugging-face

Now we will be creating an environment variable to store the Hugging Face Inference API Token as below. We can obtain the Inference API Token by signing in to the Hugging Face website and then generating a token.

import os

os.environ["HUGGINGFACE_ACCESS_TOKEN"] = "Hugging Face Inferenece API Token"

The embedchain library will use the token provided above to infer the hugging face models. Next, we must create a YAML file defining the model we want to use from huggingface. A YAML file can be considered as a simple key-value store where we define the configurations for our LLM applications. These configurations can include what LLM model we are going to use or what Embedding Model we are going to use(To learn more about the YAML file please click here). Below is an example YAML file

config = """
llm:
  provider: huggingface
  config:
    model: 'google/flan-t5-xxl'
    temperature: 0.7
    max_tokens: 1000
    top_p: 0.8


embedder:
  provider: huggingface
  config:
    model: 'sentence-transformers/all-mpnet-base-v2'
"""


with open('huggingface_model.yaml', 'w') as file:
    file.write(config)

We are creating a YAML file from Python itself and storing it in the file named huggingface_model.yaml.
In this YAML file, we define our model parameters and even the embedding model being used.
In the above, we have specified the provider as huggingface and flan-t5 model with different configurations/parameters that include the temperature of the model, the max_tokens(i.e. The output length), and even the top_p value.
For the embedding model, we are using a popular embedding model from huggingface called the all-mpnet-base-v2, which will be responsible for creating embedding vectors for our model.

YAML Configuration

Next, we will create an app with the above YAML configuration file.

from embedchain import Pipeline as App

app = App.from_config(yaml_path="huggingface_model.yaml")

Here we import the Pipeline object as an App from the embedchain. The Pipeline object is responsible for creating LLM Apps taking in different configurations as we have defined above.
The App will create an LLM with the models specified in the YAML file. To this app, we can feed in data from different data sources, and to the same App, we can call in the query method to query the LLM on the data provided.
Now, let’s add some data.

app.add("https://en.wikipedia.org/wiki/Alphabet_Inc.")

The app.add() method will take in data and add it to the vector store.
Embedchain takes care of collecting the data from the web page, creating it into chunks, and then creating the embeddings for the data.
The data will then be stored in a vector database. The default database used in embedchain is chromadb.
In this example, we are adding the Wikipedia page of Alphabet, the parent of Google to the App.

Let’s query our App based on the uploaded data:

In the above Image, using the query() method, we have asked our App i.e. the flan-t5 model two questions related to the data that was added to the App. The model was able to answer them correctly. This way, we can add multiple data sources to the model by passing them to the add() method and internally they will be processed and the embeddings will be created for them, and finally will be added to the vector store. Then we can query the data with the query() method.

Configuring App with a Different Model and Vector Store

In the previous example, we have seen how to prepare an application that adds a website as the data and the Hugging Face Model as the underlying Large Language Model for the App. In this section, we will see how we can use other models and other vector databases to see how flexible the embedchain can be. For this example, we will be using Zilliz Cloud as our Vector Database, hence we need to download the respective Python client as shown below:

!pip install --upgrade embedchain[milvus]

!pip install pytube

The above will download the Pymilvus Python package with which we can interact with Zilliz Cloud.
The pytube library will let us convert YouTube videos to text so that they can be stored in the Vector Store.
Next, we can create a free account with the Zilliz Cloud. After creating the free account, go to the Zilliz Cloud Dashboard and create a Cluster.

After creating the Cluster we can obtain the credentials to connect to it as shown below:

OpenAI API Key

Copy the Public Endpoint and the Token and store these somewhere else, as these will be needed to connect to the Zilliz Cloud Vector Store. And now for the Large Language Model, this time we will use the OpenAI GPT model. So we will also need the OpenAI API Key to move forward. After obtaining all keys, create the environment variables as shown below:

os.environ["OPENAI_API_KEY"]="Your OpenAI API Key"

os.environ["ZILLIZ_CLOUD_TOKEN"]= "Your Zilliz Cloud Token"

os.environ["ZILLIZ_CLOUD_URI"]= "Your Zilliz Cloud Public Endpoint"

The above will store all the required credentials to the Zilliz Cloud and OpenAI as environment variables. Now it’s the time to define our app, which can be done as follows:

from embedchain.vectordb.zilliz import ZillizVectorDB

app = App(db=ZillizVectorDB())

app.add("https://www.youtube.com/watch?v=ZnEgvGPMRXA")

Here first we import the ZillizVectorDB class provided by the embedchain.
Then when creating our new app, we will pass the ZillizVectorDB() to the db variable inside the App() function.
As we have not specified any LLM, the default LLM is chosen as OpenAI GPT 3.5.
Now our app is defined with OpenAI as LLM and Zilliz as the Vector Store.
Next, we are adding a YouTube video to our app using the add() method.
Adding a YouTube video is as simple as passing the URL to add() function, all the video-to-text conversion is abstracted away by the embedchain, thus making it simple.

Zilliz Cloud

Now, the video is first converted to text, next it will be created into chunks and will be converted into vector embeddings by the OpenAI embedding model. These embeddings will then be stored inside the Zilliz Cloud. If we go to the Zilliz Cloud and check inside our cluster, we can find a new collected named “embedchain_store”, where all the data that we add to our app is stored:

As we can see, a new collection was created under the name “embedchain_store” and this collection contains the data that we have added in the previous step. Now we will query our app.

The video that was added to the app is about the new Windows 11 update. In the above image, we ask the app a question that was mentioned in the video. And the app correctly answers the question. In these two examples, we have seen how to use different Large Language Models and different databases with embedchain and have also uploaded data of different types, i.e. a webpage and a YouTube video.

Supported LLMs and Vector Stores by Embedchain

Embedchain has been growing a lot since it was released by bringing in support for a large variety of Large Language Models and Vector Databases. The supported Large Language Models can be seen below:

Hugging Face Models
OpenAI
Azure OpenAI
Anthropic
Llama2
Cohere
JinaChat
Vertex AI
GPT4All

Apart from supporting a wide range of Large Language Models, the embedchain also provides support to many vector databases that can seen in the below list:

ChromaDB
ElasticSearch
OpenSearch
Zilliz
Pinecone
Weaviate
Qdrant
LanceDB

Apart from these, the embedchain in the future will be adding support for more Large Language Models and Vector Databases.

Conclusion

While building applications with large language models, the main challenge will be when dealing with data, that is dealing with data coming from different data sources. All the data sources eventually need to be converted into a single type before being converted into embeddings. And every data source has its own way of handling it like there exists separate libraries for handling videos, others for handling websites, and so on. So, we have taken a look at a solution for this challenge with the Embedchain Python Package, which does all the heavy lifting for us, thus allowing us to integrate data from any data source without worrying about the underlying conversion.

Key Takeaways

Some of the key takeaways from this article include:

Embedchain supports a large set of Large Language Models, thus allowing us to work with any of them.
Also, Embedchain integrates with many popular Vector Stores.
A simple add() method can be used to store data of any type in the vector store.
Embedchain makes it easier to switch between LLMs and Vector DBs and provides simple methods to add and query the data.

Frequently Asked Questions

Q1. What is Embedchain?

A. Embedchain is a Python tool that allows users to add in data of any type and get it stored in a Vector Store thus allowing us to query it with any Large Language Model.

Q2. How do we use different Vector Stores in Embedchain?

A. A vector database of our choice can be given to the app we are developing either through the config.yaml file or directly to the App() class by passing the database to the “db” parameter inside the App() class.

Q3. Will the data be persisted locally?

A. Yes, in the case of using local vector databases like chromadb, when we perform an add() method, the data will be converted into vector embeddings and then be stored in a vector database like chromadb which will be persisted locally under the folder “db”.

Q4. Is it necessary to create a config.yaml for working with different Databases / LLMs?

A. No, it is not. We can configure our application by directly passing the configurations to the App() variables or instead use a config.yaml to generate an App from the YAML file. Config.yaml file will be useful to replicate the results / when we want to share the configuration of our application with someone else but it is not mandatory to use one.

Q5. What are the supported data sources by Embedchain?

A. Embedchain supports data coming from different data sources which include CSV, JSON, Notion, mdx files, docx, web pages, YouTube videos, pdfs, and many more. Embedchain abstracts away the way it handles all these data sources thus making it easier for us to add any data.

References

To learn more about the embedchain and its architecture please refer to their official documentation page and Github Repository.

https://docs.embedchain.ai
https://github.com/embedchain/embedchain

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ajay

I work as a Developer in the field of Data Science. I constantly spend time learning new things be it related to AI, DataSceine, and CyberSecurity. Deep learning and machine learning are two topics that I find particularly fascinating, and Python is my preferred language for programming. Cyber Security is another field that I'm touching upon recently. I have experience with large-scale data analysis, and I have a solid grasp of a variety of deep learning and machine learning approaches, including neural networks, regression models, and natural language processing. I'm eager to take on new challenges and make a meaningful contribution to the industry, so I'm constantly seeking for ways to enlarge and deepen my knowledge and skills in the subject.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Introduction to Embedchain – A Data Platform Tailored for LLMs

Introduction

Learning Objectives

Table of contents

What is Embedchain?

Creating First App with Embedchain

YAML Configuration

Configuring App with a Different Model and Vector Store

OpenAI API Key

Zilliz Cloud

Supported LLMs and Vector Stores by Embedchain

Conclusion

Key Takeaways

Frequently Asked Questions

References

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv