Using Maskformer for Images With Overlapping Objects

Maigari David Last Updated : 22 Nov, 2024
7 min read

Image segmentation is another popular computer vision task that has applications with different models. Its usefulness across different industries and fields has allowed for more research and improvements. Maskformer is part of another revolution of image segmentation, using its mask attention mechanism to detect objects that overlap their bounding boxes. 

Performing tasks like this would be challenging with other image segmentation models as they only detect images using the per-pixel mechanism. Maskformer solves this problem with its transformer architecture. There are other models like R-CNN and DETR that also have this capability. Still, we will examine how the maskformer breaks traditional image segmentation with its approach to complex objects.

Learning Objectives

  • Learning about instance segmentation using maskformer.
  • Getting insight into the working principle of this model.
  • Studying the model architecture of maskformer. 
  • Running inference of the maskformer model. 
  • Exploring real-life applications of maskformer. 

This article was published as a part of the Data Science Blogathon.

What is Maskformer?

Image segmentation with this model comes with various dimensions. Masformer shows great performance with semantic and instance segmentation. Knowing the difference between these two tasks is essential to computer vision. 

Semantic segmentation focuses on working on each pixel of an image individually. So, it groups the objects into one class based on the class label; that means if there is more than one car in an image, the model segments all of them into the ‘car’ class label. However, instance segmentation goes beyond just segmenting each pixel and assigning one class label. Instance segmentation separates multiple instances of the same class, so in cases where you have more than one car in an image, you can classify all of them, i.e., Car1 and Car2. 

The difference between these segmentations shows the uniqueness of the maskformer model. While other models can treat one or the other, Maskformer can handle both instance and semantic segmentation in a unified manner using its mask classification approach. 

The mask classification approach predicts a class label and a binary mask for all the instances of an object in the image. This concept, combined with additional evaluation consistent with instance and semantic segmentation, helps classify this model’s mask classification approach.

Model Architecture of the Maskformer Model

The model architecture of maskformer employs different features throughout the image processing phase to ensure that it performs the segmentation task in both semantic and instance situations. Like other recent computer vision models, maskformer uses a transformer architecture, following an encoder-decoder structure to complete segmentation tasks. 

This process starts by extracting some essential image features from the input, and the backbone orchestrates this phase. In this case, the backbone could be any popular Convolutional neural network (CNN) architecture. These systems extract image features and denote them (e.g., F). 

The denoted features are then passed to a pixel decoder that generates per-pixel embeddings. This is most times denoted as ‘E.’ It handles the global and local context of a pixel in the image. However, maskformer does more than per-pixel segmentation when working on images. And that brings in the section on per-segment embeddings. 

On the other hand, a transformer decoder also handles image features. But this time, it generates a set of ‘N’per-segment (Q) embeddings. This localizes the image segment it wants to classify, putting different important weights on various aspects of the image. The per-segment identification is the potential instance of the object in the image that the maskformer looks to identify. 

This process varies from traditional transformer architecture. Usually, input images are met with an encoder, while the decoder uses the data to process an output. However, for models like maskformer, the backbone acts as the encoder, which handles input. This input data generates feature maps that provide the data of the input. 

This concept is the foundation of how this model processes images. But how does it provide the output? There are a few details about how the class predictions and labels work for this model. Let’s dive into it; 

The per-segment embeddings generated in this process are useful for class prediction in an image. The N mask embedding can also handle potential object instances in the input image. 

Next, MaskFormer generates binary masks by performing a dot product between pixel embeddings and mask embeddings, followed by a sigmoid activation. This step produces binary masks for each object instance, allowing some masks to overlap. 

For semantic segmentation, MaskFormer combines the binary masks and class labels through matrix multiplication to create the final segmented, classified image. The semantic segmentation in this model focuses on labeling every class label based on each pixel in an image.

So, it labels every class and not the instance of these classes. A good illustration of semantic segmentation is the model labeling the class for every human in an image as ‘Humans.’ But instance segmentation would label every scenario in the image and categorise them into ‘human1’ and ‘human2.’ This attributes gives masformer the edge in segmentation compared to other models. 

DETR is another model that can perform instance segmentation. Although it is not as efficient as maskformer, its method is an improvement to the per-pixel segmentation. This model uses bounding boxes to predict the class probabilities of the objects in the image instead of mask segmentation. 

Here is an example of how segmentation with DETR works: 

DETR_bounding_boxed

How To Run the Model

Running this model takes a few simple steps. We will use the hugging face transformer library to get the resources to perform instance segmentation on an image. 

Importing the Necessary Libraries 

Firstly, you must import tools for processing and segmenting images into objects. And that is where ‘MaskFormerFeatureExtractor’ and ‘MaskFormerForInstanceSegmentation’ come into the picture; the PIL library handles images while ‘request’ fetches the image URL.

from transformers import MaskFormerFeatureExtractor, MaskFormerForInstanceSegmentation
from PIL import Image
import requests

Loading the Pre-trained Maskformer Model

The first line of code initiates a feature extractor that prepares an image for the model. It involves image resizing, normalizing, and creating image tensors. Then, we load the model (trained on the coco dataset). Maskformer can perform instance segmentation, and we have just prepared the environment for this task.

 feature_extractor = MaskFormerFeatureExtractor.from_pretrained("facebook/maskformer-swin-base-coco")
model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-coco")

Preparing the Image

Since we have the PIL library, we can load and modify images in our environment. You can load an image using its URL. This code also helps prepare the image in the format needed for the MaskFormer model.

 # Load image from URL
url = "https://images.pexels.com/photos/5079180/pexels-photo-5079180.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = feature_extractor(images=image, return_tensors="pt")
Using Maskformer for Images

Running the Model on the Preprocessed image

outputs = model(**inputs)
# model predicts class_queries_logits of shape `(batch_size, num_queries)`
# and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

This tries to provide the model with class predictions of each object instance in the image. The segmentation process would show data representing the number of potential object instances the image detects. Furthermore, we also get binary masks indicating their positions in the image.

Results

 # you can pass them to feature_extractor for postprocessing
result = feature_extractor.post_process_panoptic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
# we refer to the demo notebooks for visualization (see "Resources" section in the MaskFormer docs)
predicted_panoptic_map = result["segmentation"]

Finally, we use the feature extractor to convert the model output into a suitable format. Then, we call the function that returns a list of results in the image; it stores the final segmentation map where each pixel is assigned a label corresponding to an object class. So, the full segmentation map defines the object’s class through each pixel label.  

To display the segmented image, you need to ensure that the torch and metabolic libraries are available in the environment. This will help you visualize and process the model’s output. 

import torch
import matplotlib.pyplot as plt

Here, we visualize the output to convert it into an image format that we can display. 

# Convert to PIL image format and display
plt.imshow(predicted_panoptic_map)
plt.axis('off')
plt.show()
Using Maskformer for Images With Overlapping Objects

Real-life Application of Maskformer 

Here are some handy applications of this model across various industries; 

  • This model can be valuable in the medical industry. Instance segmentation can help in various medical imaging and diagnostics
  • Instance Segmentation has also found application in satellite image interpretation. 
  • Video surveillance is another way to leverage instance segmentation models. These models can help you detect images and identify objects in various situations. 

There are many ways to use maskformer in real life. Facial recognition, autonomous cars, and many other applications can adopt the instance segmentation capabilities of this model. 

Conclusion

Maskformer can be useful in handling complex image segmentation tasks, especially when dealing with images with overlapping objects. This ability distinguishes it from other traditional image segmentation models. Its unique transformer-based architecture makes it versatile enough for semantic and instance segmentation tasks. Maskformer improves traditional per-pixel methods and sets a new standard in segmentation, opening up further potential for advanced computer vision applications.

Resources

Key Takeaways

There are many talking points on this topic, but here are a few highlights from exploring this model; 

  • Maskformer’s Unique Approach: This model employs a special technique with the mask attention mechanism with a transformer-based framework to segment objects of images with different instances. 
  • Versatility in Application: This model is used for various purposes in different industries, including autonomous driving, medical diagnostics, and space (satellite interpretation). 
  • Segmentation Capabilities: Not many traditional models can handle dual segmentation like Maskformer, as this mode can perform semantic and instance segmentation. 

Frequently Asked Questions

Q1. What makes MaskFormer different from other traditional segmentation models?

A. This model uses a mask attention mechanism within a transformer framework, allowing it to handle overlapping objects in images better than models using per-pixel methods.

Q2. Can MaskFormer perform both semantic and instance segmentation?

A. MaskFormer is capable of semantic segmentation (labeling all class instances) and instance segmentation (distinguishing individual instances within a class).

Q3. What industries benefit from using MaskFormer?

A. MaskFormer is widely applicable in industries like healthcare (for medical imaging and diagnostics), geospatial analysis (for satellite images), and security (for surveillance systems).

Q4. How does MaskFormer produce the final segmented image?

A. It combines binary masks with class labels through matrix multiplication, creating a final segmented and classified image that accurately highlights each object instance.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hey there! I'm David Maigari a dynamic professional with a passion for technical writing writing, Web Development, and the AI world. David is an also enthusiast of data science and AI innovations.

Responses From Readers

Clear

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details