Vision Language models are the models that can process and understand both visual and language(textual input) data simultaneously. These models combine techniques from Computer Vision and Natural Language Processing to understand and generate text based on the image content and language instruction.
There are many large vision language models available such as OpenAI’s GPT-4v, Salesforce’s BLIP-2,
MiniGPT4, LLaVA, etc. to perform various image-to-text generation tasks like image captioning, visual question-answering, visual reasoning, text recognition etc. But like any other Large Language Models , these models also require heavy computational resources and exhibit slower inference speed or throughput.
On the other hand, Small Language Models (SLMs) use less memory and processing power which make them ideal for devices with limited resources. They are generally trained on much smaller and more specialized datasets. In this article, we will explore Moondream2 (a small vision-language model), its components, capabilities, and limitations.
This article was published as a part of the Data Science Blogathon.
Moondream is an open source tiny vision language model that can easily run on devices with low-resource settings. Essentially, it’s a 1.86 billion parameter model initialized with weights from SigLIP and Phi-1.5. It is good at answering questions about images, generating captions for them, and undertaking various other vision language tasks.
Moondream2 has two major components:
The SigLIP (Sigmoid Loss for Language Image Pre-Training) model is similar to the CLIP (Contrastive Language–Image Pre-training) model. It replaces the softmax loss used in CLIP with a simple pairwise sigmoid loss. This modification leads to better performance on zero-shot classification and image-text retrieval tasks. Thus, the sigmoid loss operates solely on image-text pairs, eliminating the need for
a global view of pairwise similarities across all pairs within a batch. The sigmoid loss enables the scaling up of batch sizes while also improving performance even with smaller batch sizes.
Phi-1.5 is a transformer-based small language model with 1.3 billion parameters. It was introduced by Microsoft researchers in the paper “Textbooks Are All You Need II: phi-1.5 technical report”. Essentially, it’s the successor of Phi-1. The model demonstrates remarkable performance across various benchmarks, including common sense reasoning, multi-step reasoning, language comprehension, and knowledge understanding, outperforming its 5x larger counterparts. Phi-1.5 was trained on a dataset comprising 30 billion tokens, which included 7 billion tokens from the training data of Phi-1, along with approximately 20 billion tokens generated synthetically from GPT-3.5.
Let us now see the Python implementation of moondream2 using transformers.
Prerequisites
We need to install transformers, timm (PyTorch Image Models), and einops (Einstein Operations) first before utilizing the model.
pip install transformers timm einops
Now let’s load the tokenizer and model using transformers’s AutoTokenizer and AutoModelForCausalLM
modules respectively. Since the model undergoes regular updates so it’s recommended to specify a particular release when pinning the model version as shown below.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model_id = "vikhyatk/moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, revision=revision
)
Note: To load the model onto the GPU, enable the Flash Attention on the text model by passing in attn_implementation=”flash_attention_2″ while instantiating the model.
Now let’s test the model for various vision-language tasks.
As the name suggests, it is the task of describing the content of an image in words. Let’s see with an example.
from PIL import Image
image = Image.open('busy street.jpg')
image
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "Describe this image in detail", tokenizer)
class color:
BLUE = '\033[94m'
BOLD = '\033[1m'
END = '\033[0m'
print(color.BOLD+color.BLUE+"Input:"+color.END, "Describe this image in detail")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)
Output:
So, the model generates a detailed description of the image by identifying the objects (such as clock tower, buildings, buses, people, etc.) and their activities.
Using moondream2 personalized image-to-text descriptions can also be generated as shown in the
below example.
image = Image.open('cat and dog.jpg')
image
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "Write a conversation between the two", tokenizer)
print(color.BOLD+color.BLUE+"Input:"+color.END, "Write a conversation between the two")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)
Output:
VQA (Visual Question Answering) is about answering open-ended questions about an image. We pass in the image and the question as input to the model.
image = Image.open('girl and cats.jpg')
image
enc_image = model.encode_image(image)
answer1 = model.answer_question(enc_image, "How many cats the girl is holding?", tokenizer)
answer2 = model.answer_question(enc_image, "what is their color?", tokenizer)
print(color.BOLD+color.BLUE+"Question 1:"+color.END, "How many cats the girl is holding?")
print(color.BOLD+color.BLUE+"Answer 1:"+color.END, answer1)
print(color.BOLD+color.BLUE+"Question 2:"+color.END, "what is their color?")
print(color.BOLD+color.BLUE+"Answer 2:"+color.END, answer2)
Output:
The model correctly answers the above two questions regarding the image.
Telling a story or writing poems using images. For example:
image = Image.open('beach sunset.jpg')
image
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "Write a beautiful poem about this image", tokenizer)
print(color.BOLD+color.BLUE+"Input:"+color.END, "Write a beautiful poem about this image")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)
Output:
The model writes a beautiful poem as per the contents of the input image.
Visual knowledge reasoning involves integrating external knowledge and facts, extending beyond the visible content, to address questions effectively.
image = Image.open('the great wall of China.jpg')
image
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "Tell about the history of this place", tokenizer)
print(color.BOLD+color.BLUE+"Input:"+color.END, "Tell about the history of this place")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)
Output:
The model identifies the image as the Great Wall of China and tells its history.
Answering the questions by leveraging common knowledge and contextual understanding of the visual world evoked by the image. For example:
image = Image.open('man and dog.jpg')
image
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "what does the man feel and why?", tokenizer)
print(color.BOLD+color.BLUE+"Input:"+color.END, "what does the man fell and why?")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)
Output:
Image text recognition refers to the process of automatically identifying and extracting text content from images, like OCR.
image = Image.open('written quote.jpg')
image
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "what's written on this piece of paper?", tokenizer)
print(color.BOLD+color.BLUE+"Input:"+color.END, "what's written on this piece of paper?")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)
Output :
Having seen the model implementation, now let’s look at the model performance on various standard benchmarks such as VQAv2, GQA, TextVQA, TallyQA, etc.
Moondream2 is specifically designed to answer questions about images. It has the following
limitations.
This article delves into Moondream2, a compact vision-language model tailored for resource-constrained devices. By dissecting its components and demonstrating its prowess through various image-to-text tasks, Moondream2 proves its utility in real-world applications. However, its limitations, such as difficulty with abstract queries and limited OCR capabilities, underscore the need for continual refinement. Nevertheless, Moondream2 heralds a promising avenue for efficient multi-modal understanding and generation, offering practical solutions across diverse domains.
A. Small language models offer several benefits like faster inference, lower resource requirements, cost-effectiveness, scalability, domain-specific applications, interpretability, and ease of deployment.
A. Moondream2 has two major components – SigLIP and Phi-1.5. SigLIP is a visual encoder similar to the CLIP model to perform zero-shot image classification. Phi-1.5 is part of the Phi series small language models introduced by Microsoft, it has 1.3 billion parameters.
A. Moondream2 has 1.86 billion parameters, and it consumes around 9-10 GB of memory while loading.
A. Due to its compact size, this model can operate across devices with limited resources. For instance, it can be deployed in retail settings to gather data and analyze customer behavior. Similarly, it can be used in drone and robotics applications to survey environments and identify significant activities or objects. Additionally, it serves security purposes by analyzing videos and images to detect and prevent incidents.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.