In today’s world, where data comes in various forms, including text, images, and multimedia, there is a growing need for applications to understand and process this diverse information. One such application is a multimodal image search app, which allows users to search for images using natural language queries. In this blog post, we’ll explore how to build a multimodal image search app using Titan Embeddings from Amazon, FAISS (Facebook AI Similarity Search), and LangChain, an open-source library for building applications with large language models (LLMs).
Building such an app requires combining several cutting-edge technologies, including multimodal embeddings, vector databases, and natural language processing (NLP) tools. Following the steps outlined in this post, you’ll learn how to preprocess images, generate multimodal embeddings, index the embeddings using FAISS, and create a simple application that can take in natural language queries, search the indexed embeddings, and return the most relevant images.
Let us start off by understanding some basic terminologies.
Amazon Bedrock is a fully managed service that provides a wide range of features you need to create generative AI applications with security, privacy, and responsible AI. It provides a single API for selecting high-performing foundation models (FMs) from top AI vendors like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon.
With Amazon Bedrock, you can quickly test and assess the best FMs for your use case and privately customize them with your data utilizing RAG and fine-tuning. It can also create agents that work with your enterprise systems and data sources to do tasks. You don’t need to manage any infrastructure because Amazon Bedrock is serverless. Moreover, you can safely integrate and use generative AI capabilities into your applications using the AWS services you are already familiar with.
With the help of Amazon Titan Embeddings, text embeddings, natural language text—including individual words, sentences, and even lengthy documents—may be transformed into numerical representations that can be utilized to enhance use cases like personalization, search, and clustering according to semantic similarity. Amazon Titan Embeddings, optimized for text retrieval to support Retrieval Augmented Generation (RAG) use cases, lets you leverage your exclusive data in conjunction with other FMs. It first converts your text data into numerical representations or vectors, which you can then use to search for pertinent passages from a vector database precisely.
English, Chinese, and Spanish are among the more than 25 languages that Titan Embeddings supports. It can function with single words, sentences, or full documents, depending on your use case, because you can input up to 8192 tokens. In addition to optimizing for low latency and cost-effective outcomes, the model yields output vectors with 1,536 dimensions, indicating its high degree of accuracy. You can use Titan Embeddings with a single API without managing any infrastructure because it’s available through Amazon Bedrock’s serverless experience.
Amazon Titan Embeddings is available in all AWS regions where Amazon Bedrock is available, including US East (N. Virginia) and US West (Oregon) AWS Regions.
Vector databases are specialized databases designed to store and retrieve high-dimensional data efficiently. This data is often represented as vectors, which are numerical arrays that capture the essential features or characteristics of the data point.
Vector databases are powerful tools for applications that demand efficient retrieval based on similarity. Their ability to handle high-dimensional data and find semantic connections makes them valuable assets in various fields where similar data points hold significant value.
Also Read: How Does it Work & Top 15 Vector Databases 2024
FAISS, a Facebook AI Similarity Search, is a free and open-source library that Meta (formerly Facebook) developed for efficient similarity search in high-dimensional vector spaces. It’s particularly well-suited for large datasets containing millions or even billions of vectors.
What Does It Do?
!pip install \
"boto3>=1.28.57" \
"awscli>=1.29.57" \
"botocore>=1.31.57"\
"langchain==0.1.16" \
"langchain-openai==0.1.3"\
"langchain-community==0.0.33"\
"langchain-aws==0.1.0"\
"faiss-cpu"
Now lets import the required libraries.
import os
import boto3
import json
import base64
from langchain_community.vectorstores import FAISS
from io import BytesIO
from PIL import Image
The first step is determining whether we will be processing text or images. We identify this using the get_multimodal_vector function. This takes the input and utilizes the Amazon Titan model through the InvokeModel API from Amazon Bedrock to generate a joint embedding vector for the image or text, if applicable.
# This function is named get_multimodal_vector and it takes two optional arguments
def get_multimodal_vector(input_image_base64=None, input_text=None):
# Creates a Boto3 session object, likely to interact with AWS services
session = boto3.Session()
# Creates a Bedrock client object to interact with the Bedrock service
bedrock = session.client(service_name='bedrock-runtime')
# Creates an empty dictionary to hold the request data
request_body = {}
# If input_text is provided, add it to the request body with the key "inputText"
if input_text:
request_body["inputText"] = input_text
# If input_image_base64 is provided, add it to the request body with the key "inputImage"
if input_image_base64:
request_body["inputImage"] = input_image_base64
# Converts the request body dictionary into a JSON string
body = json.dumps(request_body)
# Invokes the model on the Bedrock service with the prepared JSON request
response = bedrock.invoke_model(
body=body,
modelId="amazon.titan-embed-image-v1",
accept="application/json",
contentType="application/json"
)
# Decodes the JSON response body from Bedrock
response_body = json.loads(response.get('body').read())
# Extracts the "embedding" value from the response, likely the multimodal vector
embedding = response_body.get("embedding")
# Returns the extracted embedding vector
return embedding
This function serves as a bridge between your Python application and the Bedrock service. It allows you to send image or text data and retrieve a multimodal vector. This potentially enables applications like image/text search, recommendation systems, or tasks requiring capturing the essence of different data types in a unified format.
get_vector_from_file function takes an image file path, encodes the image to base64, generates an embedding vector using Titan Multimodal Embeddings, and returns the vector – allowing images to be represented as vectors
# This function takes a file path as input and returns a vector representation of the content
def get_vector_from_file(file_path):
# Opens the file in binary reading mode ("rb")
with open(file_path, "rb") as image_file:
# Reads the entire file content as bytes
file_content = image_file.read()
# Encodes the binary file content into base64 string format
input_image_base64 = base64.b64encode(file_content).decode('utf8')
# Calls the get_multimodal_vector function to generate a vector from the base64 encoded image
vector = get_multimodal_vector(input_image_base64=input_image_base64)
# Returns the generated vector
return vector
This function acts as a wrapper for get_multimodal_vector. It takes a file path, reads the file content, converts it to a format suitable for get_multimodal_vector (base64 encoded string), and ultimately returns the generated vector representation.
Get the image vector from the directory.
def get_image_vectors_from_directory(path_name):
"""
This function extracts image paths and their corresponding vectors from a directory and its subdirectories.
Args:
path_name (str): The path to the directory containing images.
Returns:
list: A list of tuples where each tuple contains the image path and its vector representation.
"""
items = [] # List to store tuples of (image_path, vector)
# Get a list of filenames in the given directory
sub_1 = os.listdir(path_name)
# Loop through each filename in the directory
for n in sub_1:
# Check if the filename ends with '.jpg' (assuming JPG images)
if n.endswith('.jpg'):
# Construct the full path for the image file
file_path = os.path.join(path_name, n)
# Call the check_size_image function to potentially resize the image
check_size_image(file_path)
# Get the vector representation of the image using get_vector_from_file
vector = get_vector_from_file(file_path)
# Append a tuple containing the image path and vector to the items list
items.append((file_path, vector))
else:
# If the file is not a JPG, check for JPGs within subdirectories
sub_2_path = os.path.join(path_name, n) # Subdirectory path
for n_2 in os.listdir(sub_2_path):
if n_2.endswith('.jpg'):
# Construct the full path for the image file within the subdirectory
file_path = os.path.join(sub_2_path, n_2)
# Call the check_size_image function to potentially resize the image
check_size_image(file_path)
# Get the vector representation of the image using get_vector_from_file
vector = get_vector_from_file(file_path)
# Append a tuple containing the image path and vector to the items list
items.append((file_path, vector))
else:
# Print a message if a file is not a JPG within the subdirectory
print(f"Not a JPG file: {n_2}")
# Return the list of tuples containing image paths and their corresponding vectors
return items
This function takes a directory path (path_name) as input and aims to create a list of tuples. Each tuple contains the path to an image file (expected to be a JPG) and its corresponding vector representation.
def check_size_image(file_path):
"""
This function checks if an image exceeds a predefined maximum size and resizes it if necessary.
Args:
file_path (str): The path to the image file.
Returns:
None
"""
# Maximum allowed image size (replace with your desired limit)
max_size = 2048
# Open the image using Pillow library (assuming it's already imported)
try:
image = Image.open(file_path)
except FileNotFoundError:
print(f"Error: File not found - {file_path}")
return
# Get the image width and height in pixels
width, height = image.size
# Check if either width or height exceeds the maximum size
if width > max_size or height > max_size:
print(f"Image '{file_path}' exceeds maximum size: width: {width}, height: {height} px")
# Calculate the difference between current size and maximum size for both dimensions
dif_width = width - max_size
dif_height = height - max_size
# Determine which dimension needs the most significant resize based on difference
if dif_width > dif_height:
# Calculate the scaling factor based on the width exceeding the limit most
scale_factor = 1 - (dif_width / width)
else:
# Calculate the scaling factor based on the height exceeding the limit most
scale_factor = 1 - (dif_height / height)
# Calculate new width and height based on the scaling factor
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)
print(f"Resized image dimensions: width: {new_width}, height: {new_height} px")
# Resize the image using the calculated dimensions
new_image = image.resize((new_width, new_height))
# Save the resized image over the original file (be cautious about this)
new_image.save(file_path)
# No resizing needed, so we don't modify the image file
return#i
This function checks if an image exceeds a predefined maximum size and resizes it if necessary.
def create_vector_db(path_name):
"""
This function creates a vector database from image files in a directory.
Args:
path_name (str): The path to the directory containing images.
Returns:
FAISS index object: The created vector database using FAISS.
"""
# Get a list of (image_path, vector) tuples from the directory
image_vectors = get_image_vectors_from_directory(path_name)
# Extract text embeddings (assumed to be empty strings) and image paths
text_embeddings = [("", item[1]) for item in image_vectors] # Empty string, vector
metadatas = [{"image_path": item[0]} for item in image_vectors]
# Create a FAISS index using the extracted text embeddings (might be empty)
# and image paths as metadata
db = FAISS.from_embeddings(
text_embeddings=text_embeddings,
embedding=None, # Not explicitly setting embedding (might depend on image_vectors)
metadatas=metadatas
)
# Print information about the created database
print(f"Vector Database: {db.index.ntotal} docs")
# Return the created FAISS index object (database)
return db
# Unzips the archive named "animals.zip" (assuming it's in the current directory)
!unzip animals.zip
# Defines the base path for the extracted animal files (replace with your actual path if needed)
path_file = "./animals"
# Creates the full path name by combining the base path and potentially an empty string
path_name = f"{path_file}"
# Calls the function to create a vector database from the extracted animal files
db = create_vector_db(path_name)
The next step is to save it to the local vector database.
# Define the filename for the vector database
db_file = "animals.vdb"
# Save the created vector database (FAISS index object) to a local file
db.save_local(db_file)
# Print a confirmation message indicating the filename where the database is saved
print(f"Vector database was saved in {db_file}")
# Define the query text to search for
query = "dog"
# Get a multimodal vector representation of the query text using get_multimodal_vector
search_vector = get_multimodal_vector(input_text=query)
# Perform a similarity search in the vector database using the query vector
results = db.similarity_search_by_vector(embedding=search_vector)
# Iterate over the returned search results
for res in results:
# Extract the image path from the result metadata
image_path = res.metadata['image_path']
# Open the image file in binary reading mode
with open(image_path, "rb") as f:
# Read the image content as bytes
image_data = f.read()
# Create a BytesIO object to hold the image data in memory
img = BytesIO(image_data)
# Open the image from the BytesIO object using Pillow library
image = Image.open(img)
# Display the retrieved image using Pillow's show method
image.show()#
Output
This article taught us how to build a multimodal smart image search tool using Titan Embeddings, FAISS, and LangChain. This tool lets users find images using everyday language, making image searches easier and more intuitive. We covered everything step by step, from preparing images to creating search functions. Developers can use AWS Bedrock, Boto3, and free software to make strong, scalable tools that handle different kinds of data. Now, developers can create smart search tools, combining data types to improve search results and user experiences.
A. While this article focuses on images and text, similar approaches can be adapted for other types of multimodal data, such as audio and text. The key is to leverage appropriate models and techniques for each data modality and ensure compatibility with the chosen vector database and search algorithms.
A. Performance tuning can involve various strategies, including optimizing model parameters, fine-tuning embeddings, adjusting search algorithms and parameters, and optimizing infrastructure resources. Experimentation and iterative refinement are key to achieving optimal performance.
A. When using cloud-based AI services, it’s essential to consider privacy and security implications, especially when dealing with sensitive data. Ensure compliance with relevant regulations, implement appropriate access controls and encryption mechanisms, and regularly audit and monitor the system for security vulnerabilities.
A. Yes, the architecture presented in this article is suitable for deployment in production environments. However, before production deployment, ensure proper scalability, reliability, performance testing, and compliance with relevant operational best practices and security standards.
A. Yes, several alternative cloud platforms and services offer similar capabilities for AI model hosting, such as Google Cloud AI Platform, Microsoft Azure Machine Learning, and IBM Watson. Evaluate each platform’s features, pricing, and ecosystem support to determine the best fit for your requirements.