2023 has been an AI year, from language models to stable diffusion models. One of the new players that has taken center stage is the KOSMOS-2, developed by Microsoft. It is a multimodal large language model (MLLM) making waves with groundbreaking capabilities in understanding text and images. Developing a language model is one thing, while creating a model for vision is another, but having a model with both technologies is another whole level of Artificial intelligence. In this article, we will delve into the features and potential applications of KOSMOS-2 and its impact on AI and machine learning.
This article was published as a part of the Data Science Blogathon.
KOSMOS-2 is the brainchild of a team of researchers at Microsoft in their paper titled “Kosmos-2: Grounding Multimodal Large Language Models to the World.” Designed to handle text and images simultaneously and redefine how we interact with multimodal data, KOSMOS-2 is built on a Transformer-based causal language model architecture, similar to other renowned models like LLaMa-2 and Mistral AI’s 7b model.
However, what sets KOSMOS-2 apart is its unique training process. It is trained on a vast dataset of grounded image-text pairs known as GRIT, where text contains references to objects in images in the form of bounding boxes as special tokens. This innovative approach allows KOSMOS-2 to provide a new understanding of text and images.
One of the standout features of KOSMOS-2 is its ability to perform “multimodal grounding.” This means that it can generate captions for images that describe the objects and their location within the image. This reduces “hallucinations,” a common issue in language models, dramatically improving the model’s accuracy and reliability.
This concept connects text to objects in images through unique tokens, effectively “grounding” the objects in the visual context. This reduces hallucinations and enhances the model’s ability to generate accurate image captions.
KOSMOS-2 also excels in “referring expression generation.” This feature lets users prompt the model with a specific bounding box in an image and a question. The model can then answer questions about specific locations in the image, providing a powerful tool for understanding and interpreting visual content.
This impressive use case of “referring expression generation” allows users to use prompts and opens new avenues for natural language interactions with visual content.
We will see how to run an inference on Colab using KOSMOS-2 mode. Find the entire code here: https://github.com/inuwamobarak/KOSMOS-2
In this step, we install necessary dependencies like 🤗 Transformers, Accelerate, and Bitsandbytes. These libraries are crucial for efficient inference with KOSMOS-2.
!pip install -q git+https://github.com/huggingface/transformers.git accelerate bitsandbytes
Next, we load the KOSMOS-2 model and its processor.
from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", load_in_4bit=True, device_map={"": 0})
In this step, we do image grounding. We load an image and provide a prompt for the model to complete. We use the unique <grounding> token, crucial for referencing objects in the image.
import requests
from PIL import Image
prompt = "<grounding>An image of"
url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)
image
Next, we prepare the image and prompt for the model using the processor. We then let the model autoregressively generate a completion. The generated completion provides information about the image and its content.
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda:0")
# Autoregressively generate completion
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Convert generated token IDs back to strings
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
We look at the raw generated text, which may include some tokens related to image patches. This post-processing step ensures that we get meaningful results.
print(generated_text)
<image>. the, to and of as in I that' for is was- on’ it with The as at bet he have from by are " you his “ this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming up by<phrase> a fire</phrase><object><patch_index_0006><patch_index_0879></object>
This step focuses on the generated text beyond the initial image-related tokens. We extract details, including object names, phrases, and location tokens. This extracted information is more meaningful and allows us to better understand the model’s response.
# By default, the generated text is cleaned up and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)
print(processed_text)
print(entities)
An image of a snowman warming up by a fire
[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a fire', (36, 42), [(0.203125, 0.015625, 0.484375, 0.859375)])]
end_of_image_token = processor.eoi_token
caption = generated_text.split(end_of_image_token)[-1]
print(caption)
<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming up by<phrase> a fire</phrase><object><patch_index_0006><patch_index_0879></object>
We show how to visualize the bounding boxes of objects identified in the image. This step allows us to understand where the model has located specific objects. We leverage the extracted information to annotate the image.
from PIL import ImageDraw
width, height = image.size
draw = ImageDraw.Draw(image)
for entity, _, box in entities:
box = [round(i, 2) for i in box[0]]
x1, y1, x2, y2 = tuple(box)
x1, x2 = x1 * width, x2 * width
y1, y2 = y1 * height, y2 * height
draw.rectangle(xy=((x1, y1), (x2, y2)), outline="red")
draw.text(xy=(x1, y1), text=entity)
image
KOSMOS-2 allows you to interact with specific objects in an image. In this step, we prompt the model with a bounding box and a question related to a particular object. The model provides answers based on the context and information from the image.
url = "https://huggingface.co/ydshieh/kosmos-2-patch14-224/resolve/main/pikachu.png"
image = Image.open(requests.get(url, stream=True).raw)
image
We can prepare a question and a bounding box for Pikachu. The use of special <phrase> tokens indicates the presence of a phrase in the question. This step showcases how to get specific information from an image with grounded question answering.
prompt = "<grounding> Question: What is<phrase> this character</phrase>? Answer:"
inputs = processor(text=prompt, images=image, bboxes=[(0.04182509505703422, 0.39244186046511625, 0.38783269961977185, 1.0)], return_tensors="pt").to("cuda:0")
We allow the model to autoregressively complete the question, generating an answer based on the provided context.
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# By default, the generated text is cleaned up, and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)
print(processed_text)
print(entities)
Question: What is this character? Answer: Pikachu in the anime.
[('this character', (18, 32), [(0.046875, 0.390625, 0.390625, 0.984375)])]
KOSMOS-2’s capabilities extend far beyond the lab and into real-world applications. Some of the areas where it can make an impact include:
We have seen that KOSMOS-2’s capabilities extend beyond traditional AI and language models. Let us see specific application:
KOSMOS-2 represents a leap forward in the field of multimodal AI. Its ability to precisely understand and describe text and images opens up possibilities. As AI grows, models like KOSMOS-2 drive us closer to realizing advanced machine intelligence and are set to revolutionize industries.
This is one of the closest models that drive toward artificial general intelligence (AGI), which is currently only a hypothetical type of intelligent agent. If realized, an AGI could learn to perform tasks that humans can perform.
Microsoft’s KOSMOS-2 is a testament to the potential of AI in combining text and images to create new capabilities and applications. Finding its way into domains, we can expect to see AI-driven innovations that were considered beyond the reach of technology. The future is getting closer, and models like KOSMOS-2 are shaping it. Models like KOSMOS-2 are a step forward for AI and machine learning. They will bridge the gap between text and images, potentially revolutionizing industries and opening doors to innovative applications. As we continue to explore the possibilities of multimodal language models, we can expect exciting advancements in AI, paving the way for the realization of advanced machine intelligence like AGIs.
A1: KOSMOS-2 is a multimodal large language model developed by Microsoft. What sets it apart is its ability to understand both text and images simultaneously, with a unique training process involving bounding boxes in-text references.
A2: KOSMOS-2 enhances accuracy by performing multimodal grounding, which generates image captions with object locations. This reduces hallucinations and provides an understanding of visual content.
A3: Multimodal grounding is the ability of KOSMOS-2 to connect text to objects in images using unique tokens. This is crucial for reducing ambiguity in language models and improving their performance in visual content tasks.
A4: KOSMOS-2 can be integrated into robotics, document intelligence, multimodal dialogue systems, and image captioning. It enables robots to understand their environment, process complex documents, and natural language interactions with visual content.
A5: KOSMOS-2 uses unique tokens and bounding boxes in-text references for object locations in images. These tokens guide the model in generating accurate captions that include object positions.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.