Supercharge Your Embeddings Pipeline with EmbedAnything

Akshay Ballal Last Updated : 12 Jun, 2024

6 min read

Introduction

In this article, we are going to explore a new tool, EmbedAnything, and see how it works and what you can use it for. EmbedAnything is a high-performance library that allows you to create image and text embeddings directly from files using local embedding models as well as cloud-based embedding models. We will also see an example of this in action with a scenario where we want to group fashion images together based on the apparel that they show.

Supercharge Your Embeddings Pipeline with EmbedAnything

What Are Embeddings?
How Are Embeddings Made?
- Pre-Transformer Era
- Post Transformer Era
Where Are These Embeddings Used?
How Does EmbedAnything Help?
How to Use EmbedAnything

What Are Embeddings?

If you are working in the AI space or worked with large language models (LLMs), you would have definitely come across the term embeddings. In simple words, embeddings are a compressed representation of a sentence or a word. It is basically a vector of floating point numbers. It could be of any size ranging from 100 to as big as 5000.

How Are Embeddings Made?

Embedding models have evolved a lot over the years. The earliest models were based on one hot encoding or word occurrences. However, with new technological developments and more data availability, embedding models have become more powerful.

Pre-Transformer Era

The simplest way to represent a word as an embedding is using a one-hot encoding with the total vocabulary size of the text corpus. However, this is extremely inefficient as the representation is very sparse, and the size of the embedding is as big as the vocabulary size, which can be up to millions.

The next approach is using an NGram model, which uses a simple, fully connected neural network. There are two methods Skip-gram and Continuous Bag of Words (CBOW). These methods are very efficient and fall under the Word2Vec umbrella. CBOW predicts the target word from its context, while Skip-gram tries to predict the context words from the target word.

Another approach is GloVe (Global Vectors for Word Representation). GloVe focuses on leveraging the statistical information of word co-occurrence across a large corpus.

Post Transformer Era

Bidirectional Encoder Transformer

One of the earliest ways to build contextual embeddings using transformers was BERT. What is BERT, if you wonder? It’s a self-supervised way of predicting masked words. It means if we [MASK] one word in a sentence, it just tries to predict what that word could be, and thus, the information moves both from the left to right and right to left of the masked word.

Sentence Embeddings

What we have seen so far are ways to create word embeddings. But in many cases, we want to capture a representation of a sentence instead of just the words in the sentence. There are several ways to create sentence embeddings from word embeddings. One of the most prevalent methods is using pre-trained models like SBERT (Sentence BERT). These are trained by creating a dataset of similar pairs of sentences and performing contrastive learning with similarity scores.

There are several methods for using sentence embedding models. The easiest is to use cloud-based embedding models like OpenAI, Jina, or Cohere. There are several local models as well on Hugging Face that can be used like AllMiniLM6.

Multimodal Embeddings

Multimodal embeddings are vector representations that encode information from multiple types of data into a common vector space. This allows models to understand and correlate information across different modalities. One of the most used multimodal embedding models is CLIP, which embeds text and images in a shared embedding space.

Where Are These Embeddings Used?

Embeddings have a lot of applications across various industries. Here are some of the most common use cases:

Information Retrieval

Search Engines: Embeddings are used to improve search relevance by understanding the semantic meaning of queries and documents.
Retrieval Augmented Generation (RAG): Embeddings are used to retrieve knowledge for Large Language models. This is called LLM Grounding.
Document Clustering and Topic Modeling: Embeddings help in grouping similar documents together and discovering latent topics in a corpus

Multimodal Applications

Image Captioning: Combining text and image embeddings to generate descriptive captions for images.
Visual Question Answering: Using both visual and textual embeddings to answer questions about images.
Multimodal Sentiment Analysis: Combining text, image, and audio embeddings to analyze sentiment from multimedia content.

How Does EmbedAnything Help?

AI models are not easy to run. They are computationally very intensive, not easy to deploy, and hard to monitor. EmbedAnything lets you run embedding models efficiently and makes them deployment-friendly.

Here are some of the benefits of using EmbedAnything:

Compatibility with Local and Cloud Models: Seamless integration with local and cloud-based embedding models.
High-Speed Performance: Fast processing to meet demanding application requirements.
Multimodal Capability: Flexibility to handle various modalities.
CPU and GPU Acceleration: Performance optimization for both CPU and GPU environments.
Lightweight Design: Minimized footprint for efficient resource utilization.

Let us see in detail some of these advantages in detail:

Keeping it Local

While cloud-based embedding services like OpenAI, Jina, and Mistral offer convenience, many users require the flexibility and control of local embedding models. Here’s why local models are crucial for some use cases:

Cost-Effectiveness: Cloud services often charge per API call or model usage. Running embeddings locally on your own hardware can significantly reduce costs, especially for projects with frequent or high-volume embedding needs.
Data Privacy: Certain data, like medical records or financial documents, might be too sensitive to upload to the cloud. Local embedding keeps your data confidential and under your control.
Offline Functionality: An internet connection isn’t always guaranteed. Local models ensure your embedding tasks can run uninterrupted even without an internet connection.

Performance

EmbedAnything is built with Rust. This makes it faster and provides type safety and a much better development experience. But why is speed so crucial in this process?

Creating embeddings from files involves two steps that demand significant computational power:

Extracting Text from Files, Especially PDFs: Text can exist in different formats such as markdown, PDFs, and Word documents. However, extracting text from PDFs can be challenging and often causes slowdowns. It is especially difficult to extract text in manageable batches as embedding models have a context limit. Breaking the text into paragraphs containing focused information can help.
Inferencing on the Transformer Embedding Model: The transformer model is usually at the core of the embedding process, but it is known for being computationally expensive. To address this, EmbedAnything utilizes the Candle Framework by Hugging Face, a machine-learning framework built entirely in Rust for optimized performance.

The Benefit of Rust for Speed

By using Rust for its core functionalities, EmbedAnything offers significant speed advantages:

Rust is Compiled: Unlike Python, Rust compiles directly to machine code, resulting in faster execution.
Memory Management: Rust enforces memory management simultaneously, preventing memory leaks and crashes that can plague other languages.
Rust achieves true multithreading.

CPU and GPU Acceleration with Candle

Running language models or embedding models locally can be difficult, especially when you want to deploy a product that utilizes these models. If you use the transformers library from Hugging Face in Python, you will depend on PyTorch for tensor operations. This, in turn, has a dependency on Libtorch, which means that you will need to include the entire Libtorch library with your product. Also, Candle allows inferences on CUDA-enabled GPUs right out of the box. We will soon post on how we use Candle to increase the performance and decrease the memory usage of EmbedAnything.

Multimodality

Finally, let’s see how EmbedAnything handles multimodality. When a directory is passed for embedding to EmbedAnything, the file extension is checked to see if it is text or image, and a suitable embedding model is used to generate the embeddings. Thus, it is very easy to embed documents regardless of their file type, be it .docx, .md, .pdf. Images can also be directly embedded. In future versions, there will also be the ability to embed audio files.

How to Use EmbedAnything

Let’s look at an example of how convenient it is to use EmbedAnything. We will look at the zero-shot classification of fashion images. Let’s say we have some images like this:

We want the model to categorize them as [‘Shirt’, ‘Coat’, ‘Jeans’, ‘Skirt’, ‘Hat’, ‘Shoes’, ‘Bag’].

To get started, you’ll need to install the embed-anything package:

pip install embed-anything

Next, import the necessary dependencies:

import embed_anything
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

With just two lines of code, you can obtain the embeddings for all the images in a directory using CLIP embeddings:

data = embed_anything.embed_directory("images", embeder= "Clip") # embed "images" folder
embeddings = np.array([data.embedding for data in data])#import csv

Define the labels that you want the model to predict and embed the labels:

labels = ['Shirt', 'Coat', 'Jeans', 'Skirt', 'Hat', 'Shoes', 'Bag']
label_embeddings = embed_anything.embed_query(labels, embeder= "Clip")
label_embeddings = np.array([label.embedding for label in label_embeddings])

fig, ax = plt.subplots(1, 5, figsize=(20, 5))
for i in range(len(data)):
    similarities = np.dot(label_embeddings, data[i].embedding)
    max_index = np.argmax(similarities)
    
    image_path = data[i].text
    
    # Open and plot the image
    img = Image.open(image_path)
    ax[i].imshow(img)
    ax[i].axis('off')
    ax[i].set_title(labels[max_index])

That’s it. Now, we just check the similarities between the image embeddings and the label embeddings and assign the label to the image with the highest similarity. We can also visualize the output.

Conclusion

With EmbedAnything, adding more images to the folder or more labels to the list is effortless. This method scales very well and does not require any training, making it a powerful tool for zero-shot classification tasks

In this article, we learned about embedding models and how to use EmbedAnything to enhance your embedding pipeline, speeding up the generation process with just a few lines of code.

You can check out EmbedAnything, here.

We are actively looking for contributors to build and extend the pipeline to make embeddings easier and more powerful.

Akshay Ballal

AI Developer @ Serpentine AI || TU Eindhoven
Making Starlight - Semantic Search Engine for Windows in Rust 🦀.
Building EmbedAnything - A minimal embeddings pipeline built on Candle.
I love watching large AI models train.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Supercharge Your Embeddings Pipeline with EmbedAnything

Introduction

Table of Contents

What Are Embeddings?

How Are Embeddings Made?

Pre-Transformer Era

Post Transformer Era

Bidirectional Encoder Transformer

Sentence Embeddings

Multimodal Embeddings

Where Are These Embeddings Used?

Information Retrieval

Multimodal Applications

How Does EmbedAnything Help?

Keeping it Local

Performance

The Benefit of Rust for Speed

CPU and GPU Acceleration with Candle

Multimodality

How to Use EmbedAnything

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC