Image captioning using Pretrained ViT models can be seen as a text or written description beneath an image meant to provide a description of the details of the image. It is the task of translating an image into a textual description. It is done by connecting Vision (image) and Language (Text). In this article, we achieve this using Vision Transformers (ViT) in images as the major technology using the PyTorch backend. The goal is to show a way of employing transformers, ViTs in particular in generating image captions, using trained models without retraining from scratch.
With the current trend of social media platforms and online usage of pictures, the benefits of this skill are numerous and could be done for many reasons including description, citation, to aid the visually impaired, and even search engine optimization. This makes learning this technique very handy for projects that involve images.
You can find the entire code used in this GitHub repo.
This article was published as a part of the Data Science Blogathon.
Before we look into Vit, let’s start with an understanding of Transformers. Since the introduction of transformers in 2017 by Google Brain, it steered an interest in its capability in NLP. A transformer is a deep learning model distinguished by its adoption of self-attention, differentially weighting the significance of each part of the input data. And has been used primarily in the fields of natural language processing (NLP)).
Transformers process sequential input data, such as in natural language but transformers process the entire input all at once. With the help of the attention mechanism, there is a context for any position in the input sequence. This efficiency allows for more parallelization and reduces training times while improving efficiency.
Now let us look into the architectural makeup of transformers. The Transformer architecture is made up of an encoder-decoder structure primarily. The encoder-decoder structure of the Transformer architecture was presented in a famous paper titled “Attention Is All You Need”.
The encoder is made up of layers responsible for processing the input iteratively one layer after another, while on the other hand, the decoder layers receive the encoder output and generate a decoded output. Simply put, the encoder maps the input sequence to a sequence which is then fed into a decoder. The decoder then generates an output sequence.
Since this article shows a practical use of ViTs in image captioning, it is useful to also have an understanding of how ViTs work. Vision transformers are a type of transformers that perform visual-related tasks that include images. They are a transformer that also use attention mechanisms to find the relationships between input images. In this use case, they will connect our image with tokens or texts.
With the understanding of what transformers are and how they work, let us go on to implement our image captioning model. We will start by installing the transformer library and then build the model before using our model to generate captions of images.
Before we go on to write the codes, let us bring to mind that we are actually using the vit-gpt2-image-captioning model trained for image captioning made available from the Hugging Face library. The backbone of this model is a vision transformer.
The first thing is to install the Transformer library since it is not pre-installed yet in Colab.
# Installing Transformer Libraries
!pip install transformers
Now, we can import libraries.
# Web links Handler
import requests
# Backend
import torch
# Image Processing
from PIL import Image
# Transformer and pre-trained Model
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, GPT2TokenizerFast
# Managing loading processing
from tqdm import tqdm
# Assign available GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
You can find the entire code in this GitHub repo.
# Loading a fine-tuned image captioning Transformer Model
# ViT Encoder-Decoder Model
model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning").to(device)
# Corresponding ViT Tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
# Image processor
image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
We have introduced three (3) pre-trained models from the transformers class. Let us see their functions briefly.
Now we need to create a function for loading URLs and processing the images we wish to capture.
# Accesssing images from the web
import urllib.parse as parse
import os
# Verify url
def check_url(string):
try:
result = parse.urlparse(string)
return all([result.scheme, result.netloc, result.path])
except:
return False
# Load an image
def load_image(image_path):
if check_url(image_path):
return Image.open(requests.get(image_path, stream=True).raw)
elif os.path.exists(image_path):
return Image.open(image_path)
So we just created two functions to, first verify a URL and another function to use that verified URL to load the image for capturing.
Inference helps us to come up with a reasonable conclusion about the image based on its characteristics. An approach is to convert the image to tensors using PyTorch (as used here) or deal with it as pixels. To perform our inference, we use the general method as shown below to autoregressively generate the caption.
# Image inference
def get_caption(model, image_processor, tokenizer, image_path):
image = load_image(image_path)
# Preprocessing the Image
img = image_processor(image, return_tensors="pt").to(device)
# Generating captions
output = model.generate(**img)
# decode the output
caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
return caption
We have used greedy decoding which is the default. Other options might include beam search or multinomial sampling. You can experiment with them and see the difference.
Finally, we can load and capture our images as we require. We will load a number of images and see how the capturing performs. Note these images were not from the coco dataset but from sources across the web. Feel free to use your images as desired.
# Image media display
from IPython.display import display
# Loading URLs
url = "https://images.pexels.com/photos/101667/pexels-photo-101667.jpeg?auto=compress&cs=tinysrgb&w=600"
# Display Image
display(load_image(url))
# Display Caption
get_caption(model, image_processor, tokenizer, url)
Caption:
a black horse running through a grassy field
# Loading URLs
url = "https://images.pexels.com/photos/103123/pexels-photo-103123.jpeg?auto=compress&cs=tinysrgb&w=600"
# Display Image
display(load_image(url))
# Display Caption
get_caption(model, image_processor, tokenizer, url)
Caption:
a man standing on top of a hill with a mountain
# Loading URLs
url = "https://images.pexels.com/photos/406014/pexels-photo-406014.jpeg?auto=compress&cs=tinysrgb&w=600"
# Display Image
display(load_image(url))
# Display Caption
get_caption(model, image_processor, tokenizer, url)
Caption:
a dog with a long nose
Before we wrap off let us see a few other use cases of Vision Transformers other than Image captioning:
We have carried out Image captioning using Vision Transformers (ViT) technology with a PyTorch backend. ViTs are deep learning models that process sequential input data and reduce training times. Using the pre-trained models VisionEncoderDecoderModel, GPT2TokenizerFast, and ViTImageProcessor, provided an easy way of building without building from scratch. They also have the ability to outperform supervised pre-training and are suitable for image captioning.
A. Vision transformers are widely applied in image recognition, generative modeling, and multi-model tasks.
A. Vision transformers have three main components: an optimizer, dataset-specific parameters, and network depth. They outperform CNNs in fewer datasets, have no inductive biases, and handle input image distortions more robustly using attention mechanisms.
A. Image captioning model uses an encoder and decoder structure to extract features, using models, transformers, and various libraries.
A. Image captioning helps to provide a textual representation of an image. The benefits include helping the visually impaired get the context of an image by using screen readers to read the text.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.