Have you ever wondered how our intricate brains process the world? While the brain’s inner workings remain a mystery, we can liken it to a versatile neural network. Thanks to electrochemical signals, it handles various data types – audio, visuals, smells, tastes, and touch. As AI advances, multi-modal models emerge, revolutionizing search capabilities. This innovation opens up possibilities, enhancing search accuracy and relevance. Discover the fascinating realm of multi-modal search.
This article was published as a part of the Data Science Blogathon.
If you google it, you will find that multi-modal refers to involving multiple modes or methods in a process. In Artificial Intelligence, the multi-modal models are those neural networks that can process and understand different datatypes. For example, GPT-4 and Bard. These are LLMs that can understand texts and images. Other examples could be Tesla auto driver cars combining visual and sensory data to make sense of the surroundings, and Midjourney or Dalle, which can make pictures out of text descriptions.
CLIP is an open-source multi-modal neural network from OpenAI trained on a large dataset of image-text pairs. This ensures CLIP learns to associate visual concepts in images with their text descriptions. The CLIP model can be instructed in human language to classify a wide range of image data without specific training.
The zero-shot capability of CLIP is comparable to that of GPT 3. Therefore, CLIP can be used to classify images into any set of categories without having to be trained on those categories specifically. For example, to classify images of dogs vs. cats, we only need to compare the logit scores of the image with the text description “an image of a dog” or “an image of a cat”; A photo of a cat or dog is more likely to have higher logit scores with their respective text descriptions.
This is known as zero-shot classification because CLIP does not need to be trained on a dataset of images of dogs and cats to be able to classify them. Here’s a visual presentation of how CLIP works.
CLIP uses a Vision Transformer(ViT) for images and a text model for text features. The vector encodings are then projected to a shared vector space with identical dimensions. The dot product between the two is used as a similar score to predict the similarity between the text snippet and the image. In other words, CLIP can classify images into any set of categories without being optimized for it. In this article, We will programmatically implement CLIP.
Machine learning algorithms do not understand data in their raw format. So, to make it work, we need to transform data into their numerical form. Vectors or embeddings are the numerical representations of various datatypes such as texts, images, audio, and videos. However, traditional databases are not fully capable of querying high-dimensional vector data. To build an application that uses millions of vector embeddings, we need a database that can store, search, and query them. This is not possible with traditional databases. To achieve this, we need vector databases, purpose-built to store and query embeddings.
The following picture illustrates a simplified workflow of a vector database.
We need specialized embedding models capable of capturing the underlying semantic meaning of the data. The models are different for different data types. Use Image models such as Resnet or Visual Transformers for image data. For texts, text models such as Ada and SentenceTransformers are used. For cross-modal interaction, multimodal models such as Tortoise (Text-To-Speech) and CLIP (Text-To-Image) are used. These models will be used to get the embeddings of input data. Vector databases usually have custom implementations of embedding models, but we can also define our models to get embeddings and store them in vector stores.
Embeddings are usually high-dimensional, and querying high-dimensional vectors is often time and compute-intensive. Hence, vector databases employ various indexing methods for efficient querying. Indexing refers to organizing high-dimensional vectors in a way that provides efficient querying of nearest-neighbor vectors.
Some popular indexing algorithms are HNSW (Hierarchical Navigable Small World), Product Quantizing, Inverted File System, Scalar Quantization, etc. Out of all these, HNSW is the most popular and widely used algorithm across different vector databases.
For this application, we will use the Chroma Vector Database. Chroma is an open-source vector database. It lets you quickly set up a client to store and query vectors and associated metadata. There are other such vector stores that you can use, such as Weaviate, Qdrant, Milvus, etc.
Gradio, written in Python, aims to quickly build a web interface for sharing Machine Learning models as an open-source tool. It lets us set up a demo web interface using Python. It provides the flexibility to create a decent prototype to showcase the backend models.
To know more about building, refer to this article.
This section will go through the codes to create a simple restaurant dish recommender app using Gradio, Chroma, and CLIP. Chroma doesn’t yet have out-of-the-box support for multi-modal models. So, this will be a workaround.
There are two ways to use CLIP in your project. Either OpenAI’s CLIP implementation or Huggingface’s implementation of CLIP. For this project, we will use OpenAI’s CLIP. Make sure you have a virtual environment with the following dependencies installed.
clip
torch
chromadb
gradio
This is our directory structure.
├── app.py
├── clip_chroma
├── clip_embeddings.py
├── __init__.py
├── load_data.py
The first thing we need to do is build a class to extract embeddings of images and texts. As we know, CLIP has two parts to process texts and images. We will use respective models to encode different modalities.
import clip
import torch
from numpy import ndarray
from typing import List
from PIL import Image
class ClipEmbeddingsfunction:
def __init__(self, model_name: str = "ViT-B/32", device: str = "cpu"):
self.device = device # Store the specified device for model execution
self.model, self.preprocess = clip.load(model_name, self.device)
def __call__(self, docs: List[str]) -> List[ndarray]:
# Define a method that takes a list of image file paths (docs) as input
list_of_embeddings = [] # Create an empty list to store the image embeddings
for image_path in docs:
image = Image.open(image_path) # Open and load an image from the provided path
image = image.resize((224, 224))
# Preprocess the image and move it to the specified device
image_input = self.preprocess(image).unsqueeze(0).to(self.device)
with torch.no_grad():
# Compute the image embeddings using the CLIP model and convert
#them to NumPy arrays
embeddings = self.model.encode_image(image_input).cpu().detach().numpy()
list_of_embeddings.append(list(embeddings[0]))
return list_of_embeddings
def get_text_embeddings(self, text: str) -> List[ndarray]:
# Define a method that takes a text string as input
text_token = clip.tokenize(text) # Tokenize the input text
with torch.no_grad():
# Compute the text embeddings using the CLIP model and convert them to NumPy arrays
text_embeddings = self.model.encode_text(text_token).cpu().detach().numpy()
return list(text_embeddings[0])
In the above code, we have defined a class to extract embeddings of texts and images. The class takes the model name and device as inputs. If your device supports Cuda, you can enable it by passing with the device. CLIP supports several models, such as
clip.available_models()
['RN50',
'RN101',
'RN50x4',
'RN50x16',
'RN50x64',
'ViT-B/32',
'ViT-B/16',
'ViT-L/14',
'ViT-L/14@336px']
The model name by default is set as “ViT-B/32”. You can pass any other model you wish.
The __call__ method takes a list of image paths and returns a list of numpy arrays. The get_text_embeddings method takes a string input and returns a list of embeddings.
We need to populate our vector database first. So, I collected a few images of dishes to add to our collection. So, create a list of image paths and a list of descriptions about them. The image paths will be our documents, while we will store image descriptions as metadata.
But first, create a Chroma collection.
import os
from chromadb import Client, Settings
from clip_embeddings import ClipEmbeddingsfunction
from typing import List
ef = ClipEmbeddingsfunction()
client = Client(settings = Settings(is_persistent=True, persist_directory="./clip_chroma"))
coll = client.get_or_create_collection(name = "clip", embedding_function = ef)
We imported the embedding function we defined earlier and passed it as the default embedding function for the collection.
Now, load the data into the database.
coll.add(ids=[str(i) for i in range(len(img_list))],
documents = img_list, #paths to images
metadatas = menu_description,# description of dishes
)
That’s it. Now, you are ready to build the final part.
First, create an app.py file, import the following dependencies, and initiate the embedding function.
import gradio as gr
from chromadb import Client, Settings
from clip_embeddings import ClipEmbeddingsfunction
client = Client(Settings(is_persistent=True, persist_directory="./clip_chroma"))
ef = ClipEmbeddingsfunction()
As the front end, we will this to build a simple interface that takes a search query, either a text or an image, and shows relevant image outputs.
with gr.Blocks() as demo:
with gr.Row():
with gr.Column():
query = gr.Textbox(placeholder = "Enter query")
gr.HTML("OR")
photo = gr.Image()
button = gr.UploadButton(label = "Upload file", file_types=["image"])
with gr.Column():
gallery = gr.Gallery().style(
object_fit='contain',
height='auto',
preview=True
)
Now, we will define trigger events for the gradio app.
query.submit(
fn = retrieve_image_from_query,
inputs=[query],
outputs=
)
button.upload(
fn = show_img,
inputs=[button],
outputs = [photo]).\
then(
fn = retrieve_image_from_image,
inputs=[button],
outputs=
)
In the above code, we have trigger events. We process a text query with the retrieve_image_from_query function. We first render images on the photo object and then invoke retrieve_image_from_image(), displaying the output on the Gallery object.
Run the app.py file with the gradio command and visit the local address shown in the terminal.
Now, we will define the actual functions.
def retrieve_image_from_image(image):
# Get a collection named "clip" using the specified embedding function (ef)
coll = client.get_collection(name="clip", embedding_function=ef)
# Extract the name of the image file
image = image.name
# Query the collection using the image file name as the query text
result = coll.query(
query_texts=image, # Use the image file name as the query text
include=["documents", "metadatas"], # Include both documents and metadata in the results
n_results=4 # Specify the number of results to retrieve
)
# Get the retrieved documents and their metadata
docs = result['documents'][0]
descs = result["metadatas"][0]
# Create a list to store pairs of documents and their corresponding metadata
list_of_docs = []
# Iterate through the retrieved documents and metadata
for doc, desc in zip(docs, descs):
# Append a tuple containing the document and its metadata to the list
list_of_docs.append((doc, list(desc.values())[0]))
# Return the list of document-metadata pairs
return list_of_docs
We also have another function to handle text queries.
def retrieve_image_from_query(query: str):
# Get a collection named "clip" using the specified embedding function (ef)
coll = client.get_collection(name="clip", embedding_function=ef)
# Get text embeddings for the input query using the embedding function (ef)
emb = ef.get_text_embeddings(text=query)
# Convert the text embeddings to float values
emb = [float(i) for i in emb]
# Query the collection using the text embeddings
result = coll.query(
query_embeddings=emb, # Use the text embeddings as the query
include=["documents", "metadatas"], # Include both documents and metadata in the results
n_results=4 # Specify the number of results to retrieve
)
# Get the retrieved documents and their metadata
docs = result['documents'][0]
descs = result["metadatas"][0]
# Create a list to store pairs of documents and their corresponding metadata
list_of_docs = []
# Iterate through the retrieved documents and metadata
for doc, desc in zip(docs, descs):
# Append a tuple containing the document and its metadata to the list
list_of_docs.append((doc, list(desc.values())[0]))
# Return the list of document-metadata pairs
return list_of_docs
Instead of passing texts directly in the code, we extracted the embeddings and then passed them to Choma’s query method.
So, here’s the complete code for app.py.
# Import the necessary libraries import gradio as gr from chromadb import Client, Settings from clip_embeddings import ClipEmbeddingsfunction # Initialize a chromadb client with persistent storage client = Client(Settings(is_persistent=True, persist_directory="./clip_chroma")) # Initialize the ClipEmbeddingsfunction ef = ClipEmbeddingsfunction() # Function to retrieve images from a text query def retrieve_image_from_query(query: str): # Get the "clip" collection with the specified embedding function coll = client.get_collection(name="clip", embedding_function=ef) # Get the text embeddings for the input query emb = ef.get_text_embeddings(text=query) emb = [float(i) for i in emb] # Query the collection for similar documents result = coll.query( query_embeddings=emb, include=["documents", "metadatas"], n_results=4 ) # Extract documents and their metadata docs = result['documents'][0] descs = result["metadatas"][0] list_of_docs = [] # Combine documents and descriptions into a list for doc, desc in zip(docs, descs): list_of_docs.append((doc, list(desc.values())[0])) return list_of_docs # Function to retrieve images from an uploaded image def retrieve_image_from_image(image): # Get the "clip" collection with the specified embedding function coll = client.get_collection(name="clip", embedding_function=ef) # Get the filename of the uploaded image image = image.name # Query the collection with the image filename result = coll.query( query_texts=image, include=["documents", "metadatas"], n_results=4 ) # Extract documents and their metadata docs = result['documents'][0] descs = result["metadatas"][0] list_of_docs = [] # Combine documents and descriptions into a list for doc, desc in zip(docs, descs): list_of_docs.append((doc, list(desc.values())[0])) return list_of_docs # Function to display an image def show_img(image): return image.name # Create interface using Blocks with gr.Blocks() as demo: with gr.Row(): with gr.Column(): # Text input for query query = gr.Textbox(placeholder="Enter query") gr.HTML("OR") # Image input through file upload photo = gr.Image() button = gr.UploadButton(label="Upload file", file_types=["image"]) with gr.Column(): # Display a gallery of images gallery = gr.Gallery().style( object_fit='contain', height='auto', preview=True ) # Define the input and output for the query submission query.submit( fn=retrieve_image_from_query, inputs=[query], outputs=
) # Define the input and output for image upload button.upload( fn=show_img, inputs=[button], outputs=[photo]).\ then( fn=retrieve_image_from_image, inputs=[button], outputs=
) # Launch the Gradio interface if the script is run as the main program if __name__ == "__main__": demo.launch()
Now, launch the app by running gadio app.py in the terminal and visit the local address.
GitHub Repository: https://github.com/sunilkumardash9/multi-modal-search-app
Multi-modal search can have many uses across industries.
Multi-modal search will be game-changing in the future. Being able to interact in multiple modalities opens up new avenues of growth. So, this article was about using the Chroma vector database and a multi-modal CLIP model to build a basic search app. As the Chroma database does not have out-of-the-box support for multi-modal models, we created a custom CLIP embedding class to get embeddings from images and pieced together different parts to build the food search app.
A. Multimodal search is a new approach to search that combines information from multiple modalities, such as text, images, audio, and video, to improve the accuracy and relevance of search results.
A. Multimodal AI refers to the Machine Learning models that can process and understand various modalities of data such as image, text, audio, etc.
A. Multimodal models have four modes of communication: text, image, video, and audio.
A. The approximate nearest neighbor (ANN) is a searching algorithm. It intends to find the “n” closest data points to a given point in a vector space.
A. LLMs need vector databases to efficiently store and retrieve the high-dimensional vector representations of words and phrases used to perform complex mathematical operations such as similarity matching.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.