Whenever an image appears in front of us, our brain can annotate or label it. But what about computers? How can a machine process and label an image with a highly relevant and accurate caption? It seemed quite impossible a few years back. Still, with the enhancement of Computer Vision and Deep learning algorithms, the availability of relevant datasets, and AI models, it becomes easier to build a relevant caption generator for an image. Even Caption generation is growing worldwide, and many data annotation firms are earning billions. In this guide, we will build one such annotation tool capable of generating relevant captions for the image with the help of datasets. Basic knowledge of two Deep learning techniques, including LSTM and CNN, is required. In this, Article We are talking about how you can convert image to caption and also get some insights on CNN, LSTM . These all information is provided by you so at the end of this article you will get to know about the Image to Caption and full understanding on image captioning using deep learning.
In this article, you will discover how an image caption generator and AI image description generator can transform your visual content. Learn about the benefits of using an AI image caption generator and explore various image captioning generators that help create engaging captions effortlessly. Unlock the potential of these tools to enhance your social media presence and improve audience engagement.
This article was published as a part of the Data Science Blogathon
Image caption generator is a process of recognizing the context of an image and annotating it with relevant captions using deep learning and computer vision. It includes labeling an image with English keywords with the help of datasets provided during model training. The imagenet dataset trains the CNN model called Xception. Xception is responsible for image feature extraction. These extracted features will be fed to the LSTM model, which generates the image caption.
CNN is a subfield of Deep learning and specialized deep neural networks used to recognize and classify images. It processes the data represented as 2D matrix-like images. CNN can deal with scaled, translated, and rotated imagery. It analyzes the visual imagery by scanning them from left to right and top to bottom and extracting relevant features. Finally, it combines all the parts for image classification.
Being a type of RNN (recurrent neural network), LSTM (Long short-term memory) is capable of working with sequence prediction problems. It is mostly used for the next word prediction purposes, as in Google search our system is showing the next word based on the previous text. Throughout the processing of inputs, LSTM is used to carry out the relevant information and to discard non-relevant information.
To build an image caption generator model we have to merge CNN with LSTM. We can drive that:
Image Caption Generator Model (CNN-RNN model) = CNN + LSTM
The Flickr_8K dataset represents the model training of image caption generators. The dataset is downloaded directly from the below links. The downloading process takes some time due to the dataset’s large size(1GB). In the image below, you can check all the files in the Flickr_8k_text folder. The most important file is Flickr 8k.token, which stores all the image names with captions. 8091 images are stored inside the Flicker8k_Dataset folder and the text files with captions of images are stored in the Flickr_8k_text folder.
We will use Jupyter notebooks to run our caption generator. You can download Jupyter notebooks from here. A good understanding of Python, Deep learning, and NLP is required for the implementation. If you’re not familiar with these techniques. Please refer to the link below first.
Install below libraries, to begin with, the project:
pip install TensorFlow
pip install Keras
pip install pillow
pip install NumPy
Pip install tqdm
Pip install jupyterlab
Let’s start by opening the jupyter notebook to create our Python3 project. Name your python3 file with train_caption_generate.ipynb
import numpy as np
from PIL import Image
import os
import string
from pickle import dump
from pickle import load
from keras.applications.xception import Xception #to get pre-trained model Xception
from keras.applications.xception import preprocess_input
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.preprocessing.text import Tokenizer #for text tokenization
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers.merge import add
from keras.models import Model, load_model
from keras.layers import Input, Dense#Keras to build our CNN and LSTM
from keras.layers import LSTM, Embedding, Dropout
from tqdm import tqdm_notebook as tqdm #to check loop progress
tqdm().pandas()
As we see all image captions are available in the Flickr 8k.token file of the Flickr_8k_text folder. If you analyze this file carefully, you can drive the format of image storing, each image and caption separated by a new line and carry 5 captions numbered from 0 to 4 along with.
Now we are going to define 5 functions for cleaning:
Code :
# Load the document file into memory
def load_fp(filename):
# Open file to read
file = open(filename, 'r')
text = file.read()
file.close()
return text
# get all images with their captions
def img_capt(filename):
file = load_doc(filename)
captions = file.split('n')
descriptions ={}
for caption in captions[:-1]:
img, caption = caption.split('t')
if img[:-2] not in descriptions:
descriptions[img[:-2]] = [ caption ]
else:
descriptions[img[:-2]].append(caption)
return descriptions
#Data cleaning function will convert all upper case alphabets to lowercase, removing punctuations and words containing numbers
def txt_clean(captions):
table = str.maketrans('','',string.punctuation)
for img,caps in captions.items():
for i,img_caption in enumerate(caps):
img_caption.replace("-"," ")
descp = img_caption.split()
#uppercase to lowercase
descp = [wrd.lower() for wrd in descp]
#remove punctuation from each token
descp = [wrd.translate(table) for wrd in descp]
#remove hanging 's and a
descp = [wrd for wrd in descp if(len(wrd)>1)]
#remove words containing numbers with them
descp = [wrd for wrd in descp if(wrd.isalpha())]
#converting back to string
img_caption = ' '.join(desc)
captions[img][i]= img_caption
return captions
def txt_vocab(descriptions):
# To build vocab of all unique words
vocab = set()
for key in descriptions.keys():
[vocab.update(d.split()) for d in descriptions[key]]
return vocab
#To save all descriptions in one file
def save_descriptions(descriptions, filename):
lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
lines.append(key + 't' + desc )
data = "n".join(lines)
file = open(filename,"w")
file.write(data)
file.close()
# Set these path according to project folder in you system, like i create a folder with my name shikha inside D-drive
dataset_text = "D:shikhaProject - Image Caption GeneratorFlickr_8k_text"
dataset_images = "D:shikhaProject - Image Caption GeneratorFlicker8k_Dataset"
#to prepare our text data
filename = dataset_text + "/" + "Flickr8k.token.txt"
#loading the file that contains all data
#map them into descriptions dictionary
descriptions = img_capt(filename)
print("Length of descriptions =" ,len(descriptions))
#cleaning the descriptions
clean_descriptions = txt_clean(descriptions)
#to build vocabulary
vocabulary = txt_vocab(clean_descriptions)
print("Length of vocabulary = ", len(vocabulary))
#saving all descriptions in one file
save_descriptions(clean_descriptions, "descriptions.txt")
Now we are going to use the pre-trained model called Xception which is already trained with large datasets to extract the features from these models. Xception was trained on an imagenet dataset with 1000 different classes to classify the images. We can use keras.applications to import this model directly. We need to do a few changes to the Xception model to integrate it with our model. The xception model takes 299*299*3 image size as input so we need to delete the last classification layer and extract out the 2048 feature vectors.
model = Xception( include_top=False, pooling=’avg’ )
Extract_features() function is used to extract these features for all images. At the end we will put the features dictionary into a pickle file named “features.p”.
def extract_features(directory):
model = Xception( include_top=False, pooling='avg' )
features = {}
for pic in tqdm(os.listdir(dirc)):
file = dirc + "/" + pic
image = Image.open(file)
image = image.resize((299,299))
image = np.expand_dims(image, axis=0)
#image = preprocess_input(image)
image = image/127.5
image = image - 1.0
feature = model.predict(image)
features[img] = feature
return features
#2048 feature vector
features = extract_features(dataset_images)
dump(features, open("features.p","wb"))
#to directly load the features from the pickle file.
features = load(open("features.p","rb"))
A file named “Flickr_8k.trainImages.txt” is present in our Flickr_8k_test folder. This file carries a list of 6000 image names that are used for the sake of training.
Functions required to load the training datasets:
Code :
#load the data
def load_photos(filename):
file = load_doc(filename)
photos = file.split("n")[:-1]
return photos
def load_clean_descriptions(filename, photos):
#loading clean_descriptions
file = load_doc(filename)
descriptions = {}
for line in file.split("n"):
words = line.split()
if len(words)<1 :
continue
image, image_caption = words[0], words[1:]
if image in photos:
if image not in descriptions:
descriptions[image] = []
desc = ' ' + " ".join(image_caption) + ' '
descriptions[image].append(desc)
return descriptions
def load_features(photos):
#loading all features
all_features = load(open("features.p","rb"))
#selecting only needed features
features = {k:all_features[k] for k in photos}
return features
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
#train = loading_data(filename)
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)
Machines are not familiar with complex English words so, to process model’sdata they need a simple numerical representation. That’s why we map every word of the vocabulary with a separate unique index value. An in-built tokenizer function is present in the Keras library to create tokens from our vocabulary. We can save them to a pickle file named “tokenizer.p”.
Code:
#convert dictionary to clear list of descriptions
def dict_to_list(descriptions):
all_desc = []
for key in descriptions.keys():
[all_desc.append(d) for d in descriptions[key]]
return all_desc
#creating tokenizer class
#this will vectorise text corpus
#each integer will represent token in dictionary
from keras.preprocessing.text import Tokenizer
def create_tokenizer(descriptions):
desc_list = dict_to_list(descriptions)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(desc_list)
return tokenizer
# give each word an index, and store that into tokenizer.p pickle file
tokenizer = create_tokenizer(train_descriptions)
dump(tokenizer, open('tokenizer.p', 'wb'))
vocab_size = len(tokenizer.word_index) + 1
Vocab_size #The size of our vocabulary is 7577 words.
#calculate maximum length of descriptions to decide the model structure parameters.
def max_length(descriptions):
desc_list = dict_to_list(descriptions)
return max(len(d.split()) for d in desc_list)
max_length = max_length(descriptions)
Max_length #Max_length of description is 32
For training the model as a supervised learning task we need to feed it with input and output sequences. Total 6000 images with 2048 length feature vector and the caption represented as numbers are present in our training sets. It’s not possible to hold such a large amount of data into memory so we are going to use a generator method that will yield batches.
For example: [x1, x2] are the input of our model, and y act as output, where x1 shows 2048 feature vectors of the image, x2 shows the input text sequence and y shows the output text sequence that is predicted by the model.
x1(feature vector) | x2(Text sequence) | y(word to predict) |
feature | start, | two |
feature | start, two | dogs |
feature | start, two, dogs | drink |
feature | start, two, dogs, drink | water |
feature | start, two, dogs, drink, water | end |
#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
while 1:
for key, description_list in descriptions.items():
#retrieve photo features
feature = features[key][0]
inp_image, inp_seq, op_word = create_sequences(tokenizer, max_length, description_list, feature)
yield [[inp_image, inp_sequence], op_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
x_1, x_2, y = list(), list(), list()
# move through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# divide one sequence into various X,y pairs
for i in range(1, len(seq)):
# divide into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
x_1.append(feature)
x_2.append(in_seq)
y.append(out_seq)
return np.array(X_1), np.array(X_2), np.array(y)
#To check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))
From the Functional API, we will use the Keras Model in order to define the structure of the model. It includes:
from keras.utils import plot_model
# define the captioning model
def define_model(vocab_size, max_length):
# features from the CNN model compressed from 2048 to 256 nodes
inputs1 = Input(shape=(2048,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# LSTM sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# Merging both models
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
# merge it [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
# summarize model
print(model.summary())
plot_model(model, to_file='model.png', show_shapes=True)
return model
We will generate the input and output sequences to train our model with 6000 training images. We create a function named model.fit_generator() to fit the batches to the model. At last, we save the model to our models folder.
# train our model
print('Dataset: ', len(train_imgs))
print('Descriptions: train=', len(train_descriptions))
print('Photos: train=', len(train_features))
print('Vocabulary Size:', vocab_size)
print('Description Length: ', max_length)
model = define_model(vocab_size, max_length)
epochs = 10
steps = len(train_descriptions)
# creating a directory named models to save our models
os.mkdir("models")
for i in range(epochs):
generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)
model.save("models/model_" + str(i) + ".h5")
After successful model training, our task is to test the model accuracy by inputting test image data. Let’s create a python file named test_caption.py to load the model and generate predictions.
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import argparse
ap = argparse.ArgumentParser()
ap.add_argument('-i', '--image', required=True, help="Image Path")
args = vars(ap.parse_args())
img_path = args['image']
def extract_features(filename, model):
try:
image = Image.open(filename)
except:
print("ERROR: Can't open image! Ensure that image path and extension is correct")
image = image.resize((299,299))
image = np.array(image)
# for 4 channels images, we need to convert them into 3 channels
if image.shape[2] == 4:
image = image[..., :3]
image = np.expand_dims(image, axis=0)
image = image/127.5
image = image - 1.0
feature = model.predict(image)
return feature
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
def generate_desc(model, tokenizer, photo, max_length):
in_text = 'start'
for i in range(max_length):
sequence = tokenizer.texts_to_sequences([in_text])[0]
sequence = pad_sequences([sequence], maxlen=max_length)
pred = model.predict([photo,sequence], verbose=0)
pred = np.argmax(pred)
word = word_for_id(pred, tokenizer)
if word is None:
break
in_text += ' ' + word
if word == 'end':
break
return in_text
max_length = 32
tokenizer = load(open("tokenizer.p","rb"))
model = load_model('models/model_9.h5')
xception_model = Xception(include_top=False, pooling="avg")
photo = extract_features(img_path, xception_model)
img = Image.open(img_path)
description = generate_desc(model, tokenizer, photo, max_length)
print("nn")
print(description)
plt.imshow(img)
Output:
In this guide, we build a deep learning model with the help of CNN and LSTM. We used a very small dataset of 8000 images to train our model, but the business level model used larger datasets of more than 100,000 images for better accuracy. The larger the datasets are higher the accuracy. So, if you want to build a more accurate caption generator you can try this model with large datasets. So, we hope you get the full understanding on how to convert image to caption by learning from this article.
Hope you enjoy this article about using deep learning to automatically write captions for images. The article will show you how to create your own image caption project using Python code. You’ll learn how to build an image caption generator that can describe photos in words. This is a fun project to practice machine learning and natural language processing skills. Generating captions for images is a cool application of artificial intelligence. Give it a try and see what kind of descriptions your image caption generator can come up with!
Image caption generators are a fascinating application of deep learning that bridge the gap between computer vision and natural language processing (NLP). Here’s a breakdown of how they work:
CNNs and LSTMs
The Process:
Applications:
Image caption generators have a range of applications, including:
A. An image caption generator is a system that generates textual descriptions or captions for images automatically. It combines computer vision techniques to understand the visual content of an image and natural language processing (NLP) techniques to generate descriptive captions.
A. In a project description, captioning refers to generating textual descriptions or captions for images using machine learning and NLP techniques. It involves analyzing the visual features of an image and generating a coherent and relevant caption that describes the image’s content.
A. The dataset for image caption generation typically consists of pairs of images and corresponding captions. These datasets are manually annotated, where human annotators provide descriptive captions for a given set of images. Commonly used datasets include MSCOCO, Flickr8K, and Flickr30K.
A. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture commonly used in image captioning. LSTM networks can capture long-term dependencies in sequential data, making them suitable for generating coherent and contextually relevant captions by modeling the sequential nature of language.
Vishnu kant sharma thank you