The introduction of the original transformers paved the way for the current Large Language Models. Similarly, after the introduction of the transformer model, the vision transformer (ViT) was introduced. Like the transformers which excel at understanding text and generating text given a response, vision transformer models were developed to understand images and provide information given an image. These led to the Vision Language Models, which excel at understanding images. Microsoft has taken a step forward to this and introduced a model that is capable of performing many vision tasks just with a single model. In this guide, we will be taking a look at this model called Florence-2, released by Microsoft, designed to solve many different vision tasks.
This article was published as a part of the Data Science Blogathon.
Florence-2 is a Vision Language Model (VLM) developed by the Microsoft team. Florence-2 comes in two sizes. One is a 0.23B version and the other is a 0.77B version. These low sizes make it easy for everyone to run these models on the CPU itself. Florence-2 is created keeping in mind that one model can solve everything. Florence-2 is trained to solve different tasks including object detection, object segmentation, image captioning (even generating detailed captions), phrase segmentation, OCR (Optical Character Recognition), and a combination of these too.
The Florence-2 Vision Language Model is trained on the FLD 5B dataset. This FLD-5B is a dataset created by the Microsoft team. This dataset contains about 5.4 Billion text annotations on around 126 Million images. These include 1.3 Billion text region annotations, 500 Million text annotations, and 3.6 Billion text phrase region annotations. Florence-2 accepts text instructions and image inputs, generating text results for tasks like OCR, object detection, or image captioning.
The architecture contains a visual encoder followed by a transformer encoder decoder block and for the loss, they work with the standard loss function i.e. the cross entropy loss. The Florence-2 model performs three types of region detections: box representations for object detection, quad box representations for OCR text detection, and polygon representations for segmentation tasks.
Image Captioning is a Vision Language task, where given an image, the deep learning model will output a caption about the image. This caption can be short or detailed based on the training the model has undergone. The models that perform these tasks are trained on a huge image captioning data, where they learn how to output a text, given an image. The more data they are trained on, the more they get good at describing the images.
We will start by downloading and installing some libraries that we need to run the Florence Vision Model.
!pip install -q -U transformers accelerate flash_attn einops timm
Now, we need to download the Florence-2 model. For this, we will work with the below code.
from transformers import AutoProcessor, AutoModelForCausalLM
model_id = 'microsoft/Florence-2-large-ft'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, device_map="cuda")
AutoProcessor is very similar to AutoTokenizer. But the AutoTokenizer class deals with text and text tokenization. Whereas AutoProcessor deals with both text and image tokenization, because Florence-2 deals with Image data, we work with the AutoProcessor.
Now, let us take an image:
from PIL import Image
image = Image.open("/content/beach.jpg")
Here, we have taken a beach photo.
Now we will give this image to the Florence-2 Vision Language Model and ask it to generate a caption.
PROMPT = "<CAPTION>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
result = processor.post_process_generation(text_generations,
task=PROMPT, image_size=(image.width, image.height))
print(result[PROMPT])
Running the code and seeing the output pic above, we see that the model has generated the caption “An umbrella and lounge chair on a beach with the ocean in the background” for the image. The image caption above is very short.
We can take this next step by providing other prompts like the <DETAILED_CAPTION> and the <MORE_DETAILED_CAPTION>.
The code for trying this can be seen below:
PROMPT = "<DETAILED_CAPTION>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
result = processor.post_process_generation(text_generations,
task=PROMPT, image_size=(image.width, image.height))
print(result[PROMPT])
PROMPT = "<MORE_DETAILED_CAPTION>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
result = processor.post_process_generation(text_generations,
task=PROMPT, image_size=(image.width, image.height))
print(result[PROMPT])
Here, we have gone with the <DETAILED_CAPTION> and <MORE_DETAILED_CAPTION> for the task type, and can see the results after running the code in the above pic. The <DETAILED_CAPTION> produced the output “In this image we can see a chair, table, umbrella, water, ships, trees, building and sky with clouds.” and the <MORE_DETAILED_CAPTION> Prompt produced the output “An orange umbrella is on the beach. There is a white lounge chair next to the umbrella. There are two boats in the water.” So with these two Prompts, we can get a bit more depth in the image captioning than the regular Prompt.
Object Detection is one of the well-known tasks in Computer Vision. It deals with finding some object given an image. In Object Detection, the model identifies the image and provides the X and Y coordinates of the bounding boxes around the object. The Florence-2 Vision Language Model is very much capable of detecting objects given an image.
Let us try this with the below image:
image = Image.open("/content/van.jpg")
Here, we have an image of a bright orange van on the road with a white building in the background.
Now let us give this image to the Florence-2 Vision Language Model.
PROMPT = "<OD>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
results = processor.post_process_generation(text_generations,
task=PROMPT, image_size=(image.width, image.height))
The process for the Object Detection is very similar to the Image Captioning task that we have just done. The only difference here is that we change the Prompt to <OD> meaning object detection. So we give this Prompt along with the image to the processor object and obtain the tokenized inputs. Then we give these tokenized inputs with the image pixel values to the Florence-2 Vision Language Model to generate the output. Then decode this output.
The output is stored in the variable named results. The variable results is of the format {”: { ‘bboxes’: [[x1, y1, x2, y2], …], ‘labels’: [‘label1’, ‘label2’, …] } }. So the Florence-2 Vision Model outputs the bounding box, X, Y coordinates for each label, that is for each object that it detects in the image.
Now, we will draw those bounding boxes on the image with the coordinates that we have.
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig, ax = plt.subplots()
ax.imshow(image)
for bbox, label in zip(results[PROMPT]['bboxes'], results[PROMPT]['labels']):
x1, y1, x2, y2 = bbox
rect_box = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=1,
edgecolor='r', facecolor='none')
ax.add_patch(rect_box)
plt.text(x1, y1, label, color='white', fontsize=8, bbox=dict(facecolor='red', alpha=0.5))
ax.axis('off')
plt.show()
Running this code and seeing the pic, we see that there are a lot of bounding boxes generated by the Florence-2 Vision Language Model for the van image that we have given to it. We see that the model has detected the van, windows, and wheels and was able to give the correct coordinates for each label.
Next, we have a task called “Caption to Phrase Grounding” which the Florence-2 Model supports. What the model does is, given an image and a caption of it, the task of Phrase Grounding is to find each / most relevant entity/object mentioned by a noun phrase in the given caption to a region in the image.
We can take a look at this task with the below code:
PROMPT = "<CAPTION_TO_PHRASE_GROUNDING> An orange van parked in front of a white building"
task_type = "<CAPTION_TO_PHRASE_GROUNDING>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
results = processor.post_process_generation(text_generations,
task=task_type, image_size=(image.width, image.height))
Here for the Prompt, we are giving it “<CAPTION_TO_PHRASE_GROUNDING> An orange van parked in front of a white building”, where the task is “<CAPTION_TO_PHRASE_GROUNDING>” and the phrase is the “An orange van parked in front of a white building”. The Florence model tries to generate bounding boxes to the objects/entities that it can get from this given phrase. Let us see the final output by plotting it.
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig, ax = plt.subplots()
ax.imshow(image)
for bbox, label in zip(results[task_type]['bboxes'], results[task_type]['labels']):
x1, y1, x2, y2 = bbox
rect_box = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=1,
edgecolor='r', facecolor='none')
ax.add_patch(rect_box)
plt.text(x1, y1, label, color='white', fontsize=8, bbox=dict(facecolor='red', alpha=0.5))
ax.axis('off')
plt.show()
Here we see that the Florence-2 Vision Language Model was able to extract two entities from it. One is an orange van and the other is a white building. Then Florence-2 generated the bounding boxes for each of these entities. This way, given a caption, the model can extract relevant entities/objects from that given caption and be able to generate corresponding bounding boxes for those objects.
Segmentation is a process, where an image is taken and masks are generated for multiple parts of the image. Where each mask is an object. Segmentation is the next stage of Object Detection. In object detection, we only find the location of the image and generate the bounding boxes. But in Segmentation, instead of generating a rectangular bounding box, we generate a mask that will be in the shape of the object, so it is like creating a mask for that object. This is helpful because not only do we know the location of the object, but we know even the shape of the object. And luckily, the Florence-2 Vision Language Model supports Segmentation.
We will be trying segmentation to our van image.
PROMPT = "<REFERRING_EXPRESSION_SEGMENTATION>two black tires"
task_type = "<REFERRING_EXPRESSION_SEGMENTATION>"
inputs = processor(text=PROMPT, images=image, return_tensors="pt").to("cuda")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512,
do_sample=False,
)
text_generations = processor.batch_decode(generated_ids,
skip_special_tokens=False)[0]
results = processor.post_process_generation(text_generations,
task=task_type, image_size=(image.width, image.height))
Here the results variable will be of the format {”: {‘Polygons’: [[[polygon]], …], ‘labels’: [”, ”, …]}} where each object/mask is represented by a list of polygons. And each polygon is of the form [x1,y1,x2,y2,…xn,yn].
Now, we will create these masks and overlay them on the actual image so we can visualize it better.
import copy
import numpy as np
from IPython.display import display
from PIL import Image, ImageDraw, ImageFont
output_image = copy.deepcopy(image)
res = results[task_type]
draw = ImageDraw.Draw(output_image)
scale = 1
for polygons, label in zip(res['polygons'], res['labels']):
fill_color = "blue"
for _polygon in polygons:
_polygon = np.array(_polygon).reshape(-1, 2)
if len(_polygon) < 3:
print('Invalid polygon:', _polygon)
continue
_polygon = (_polygon * scale).reshape(-1).tolist()
draw.polygon(_polygon, outline="indigo", fill=fill_color)
draw.text((_polygon[0] + 8, _polygon[1] + 2), label, fill="indigo")
display(output_image)
The Florence-2 Vision Language Model successfully understood our query of “two black tires” and inferred that the image contained a vehicle with visible black tires. The model generated polygon representations for these tires, which were masked with a blue color. The model excelled in diverse computer vision tasks due to the strong training data curated by the Microsoft Team.
Florence-2 is a Vision Language Model created and trained from the ground up by the Microsoft Team. Unlike other Vision Language Models, Florence-2 performs various computer vision tasks, including object detection, image captioning, phrase object detection, OCR, segmentation, and combinations of these. In this guide, we have taken a look at how to download the Florence-2 Large Model and how to perform different computer vision tasks with changing Prompts with the Florence-2.
A. Florence-2 is a Vision Language Model developed by the Microsoft team and was released in two sizes, a 0.23B parameter, and a 0.7B parameter version.
A. AutoTokenizer can only deal with text data where it converts text to tokens. On the other hand, AutoProcessor pre-processor data for multi-modal models which include even the image data.
A. FLD-5B is an image dataset curated by the Microsoft team. It contains about 5.4 billion image captioning for 126 million images.
A. Florence-2 model outputs text based on the given input image and input text. This text can be a simple image caption or it can the the bounding box coordinates if the task is object detection or segmentation.
A. Yes. Florence-2 is released under the MIT License, thus making it Open Source and one does not need to authenticate with HuggingFace to work with this model.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.