I have written extensive articles and guides on how to build computer vision models using image data. Detecting objects in images, classifying those objects, generating labels from movie posters – there is so much we can do using computer vision and deep learning (subset of Machine Learning).
This time, I decided to turn my attention to the less-heralded aspect of computer vision – videos! We are consuming video content at an unprecedented pace. I feel this area of computer vision holds a lot of potential for data scientists.
I was curious about applying the same computer vision algorithms to video data. The approach I used for building image classification models – was it generalizable?
Videos can be tricky for machines to handle. Their dynamic nature, as opposed to an image’s static one, can make it complex for a data scientist to build those models.
But don’t worry, it’s not that different from working with image data. In this article, we will build our very own video classification model in Python. This is a very hands-on tutorial so fire up your Jupyter notebooks – this is going to a very fun ride.
When you really break it down – how would you define videos?
We can say that videos are a collection of a set of images arranged in a specific order. These sets of images are also referred to as frames.
That’s why a video classification problem is not that different from an image classification problem. We take images for an image classification task, use feature extractors (like convolutional neural networks or CNNs) to extract features from images, and then classify that image based on these extracted features. Video classification involves just one extra step.
We first extract frames from the given video. We can then follow the same steps for an image classification task. This is the simplest way to deal with video data.
There are multiple other ways to deal with videos, and there is even a niche field of video analytics. I highly recommend going through the article below to understand how to deal with videos and extract frames in Python:
Deep Learning Tutorial to Calculate the Screen Time of Actors in any Video (with Python codes)
Also, we will be using CNNs to extract features from the frames of videos. Given their effectiveness and status as a state-of-the-art model in computer vision, CNNs are the chosen architecture for feature extraction in our video classification task. If you need a quick refresher on what CNNs are and how they work, this is where you should begin:
In our video classification task, we will be working with connectivity between frames, temporal information, and human actions. We will encode these aspects into tensors and assign class labels based on the extracted features. To enhance the model’s generalization capability, we will utilize transfer learning with architectures like ResNet.
Excited to build a model that is able to classify videos into their respective categories? We will be working on the UCF101 – Action Recognition Data Set, which consists of 13,320 different video clips belonging to 101 distinct categories.
Let me summarize the steps that we will be following to build our video classification model:
Explore the video dataset and create the training and validation set. We will use the training set to train the model and the validation set to evaluate the trained model.
Extract frames from all the videos in the training as well as the validation set.
Preprocess these frames and then train a model using the frames in the training set. Evaluate the model using the frames present in the validation set.
Once we are satisfied with the performance on the validation set, use the trained model to classify new videos.
Let’s now start exploring the data!
You can download the dataset from the official UCF101 site. The dataset is in a .rar format so we first extract the videos from it. Create a new folder, let’s say ‘Videos’ (you can pick any other name as well), and then use the following command to extract all the downloaded videos:
unrar e UCF101.rar Videos/
The official documentation of UCF101 states that:
“It is very important to keep the videos belonging to the same group separate in training and testing. Since the videos in a group are obtained from a single long video, sharing videos from the same group in training and testing sets would give high performance.”
So, we will split the dataset into the train and test sets as suggested in the official documentation. You can download the train/test split from here. Remember that you might require high computation power since we are dealing with a large dataset.
We now have the videos in one folder and the train/test splitting file in another folder. Next, we will create the dataset. Open your Jupyter notebook and follow the code block below. We will first import the required libraries:
import cv2 # for capturing videos
import math # for mathematical operations
import matplotlib.pyplot as plt # for plotting the images
%matplotlib inline
import pandas as pd
from keras.preprocessing import image # for preprocessing the images
import numpy as np # for mathematical operations
from keras.utils import np_utils
from skimage.transform import resize # for resizing images
from sklearn.model_selection import train_test_split
from glob import glob
from tqdm import tqdm
We will now store the name of videos in a dataframe:
import pandas as pd
# open the .txt file which have names of training videos
f = open("trainlist01.txt", "r")
temp = f.read()
videos = temp.split('\n')
# creating a dataframe having video names
train = pd.DataFrame()
train['video_name'] = videos
train = train[:-1]
print(train.head())
This is how the names of videos are given in the .txt file. It is not properly aligned and we will need to preprocess it. Before that, let’s create a similar dataframe for test videos as well:
# open the .txt file which have names of test videos
f = open("testlist01.txt", "r")
temp = f.read()
videos = temp.split('\n')
# creating a dataframe having video names
test = pd.DataFrame()
test['video_name'] = videos
test = test[:-1]
test.head()
Next, we will add the tag of each video (for both training and test sets). Did you notice that the entire part before the ‘/’ in the video name represents the video’s tag? Hence, we will split the entire string on ‘/’ and select the tag for all the videos:
# creating tags for training videos
train_video_tag = []
for i in range(train.shape[0]):
train_video_tag.append(train['video_name'][i].split('/')[0]) train['tag'] = train_video_tag
# creating tags for test videos
test_video_tag = []
for i in range(test.shape[0]):
test_video_tag.append(test['video_name'][i].split('/')[0])
test['tag'] = test_video_tag
So what’s next? Now, we will extract the frames from the training videos which will be used to train the model. I will be storing all the frames in a folder named train_1.
So, first of all, make a new folder and rename it to ‘train_1’ and then follow the code given below to extract frames:
# storing the frames from training videos
for i in tqdm(range(train.shape[0])):
count = 0
videoFile = train['video_name'][i]
cap = cv2.VideoCapture('UCF/'+videoFile.split(' ')[0].split('/')[1]) # capturing the video from the given path
frameRate = cap.get(5) #frame rate
x=1
while(cap.isOpened()):
frameId = cap.get(1) #current frame number
ret, frame = cap.read()
if (ret != True):
break
if (frameId % math.floor(frameRate) == 0):
# storing the frames in a new folder named train_1
filename ='train_1/' + videoFile.split('/')[1].split(' ')[0] +"_frame%d.jpg" % count;count+=1
cv2.imwrite(filename, frame)
cap.release()
This will take some time as more than 9,500 videos are in the training set. Once the frames are extracted, we will save the names of these frames with their corresponding tag in a .csv file. Creating this file will help us read the frames we will see in the next section.
# getting the names of all the images
images = glob("train_1/*.jpg")
train_image = []
train_class = []
for i in tqdm(range(len(images))):
# creating the image name
train_image.append(images[i].split('/')[1])
# creating the class of image
train_class.append(images[i].split('/')[1].split('_')[1])
# storing the images and their class in a dataframe
train_data = pd.DataFrame()
train_data['image'] = train_image
train_data['class'] = train_class
# converting the dataframe into csv file
train_data.to_csv('UCF/train_new.csv',header=True, index=False)
So far, we have extracted frames from all the training videos and saved them in a .csv file with their corresponding tags. It’s time to train our model, which we will use to predict the video tags in the test set.
It’s finally time to train our video classification model! I’m sure this is the most anticipated section of the tutorial. I have divided this step into sub-steps for ease of understanding:
So, let’s get started with the first step, where we will extract the frames. We will import the libraries first:
import keras
from keras.models import Sequential
from keras.applications.vgg16 import VGG16
from keras.layers import Dense, InputLayer, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, GlobalMaxPooling2D
from keras.preprocessing import image
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.model_selection import train_test_split
Remember, we created a .csv file that contains the names of each frame and their corresponding tag? Let’s read it as well:
train = pd.read_csv('UCF/train_new.csv')
train.head()
This is what the first five rows look like. We have the corresponding class or tag for each frame. Now, using this .csv file, we will read the frames that we extracted earlier and then store those frames as a NumPy array:
# creating an empty list
train_image = []
# for loop to read and store frames
for i in tqdm(range(train.shape[0])):
# loading the image and keeping the target size as (224,224,3)
img = image.load_img('train_1/'+train['image'][i], target_size=(224,224,3))
# converting it to array
img = image.img_to_array(img)
# normalizing the pixel value
img = img/255
# appending the image to the train_image list
train_image.append(img)
# converting the list to numpy array
X = np.array(train_image)
# shape of the array
X.shape
Output: (73844, 224, 224, 3)
We have 73,844 images each of size (224, 224, 3). Next, we will create the validation set.
To create the validation set, we need to ensure that each class’s distribution is similar in both training and validation sets. We can use the stratify parameter to do that:
# separating the target
y = train['class']
# creating the training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify = y)
Here, stratify = y (which is the class or tags of each frame) keeps a similar distribution of classes in both the training and the validation set.
Remember – there are 101 categories in which a video can be classified. So, we will have to create 101 different columns in the target, one for each category. We will use the get_dummies() function for that:
# creating dummies of target variable for train and validation set
y_train = pd.get_dummies(y_train)y_test = pd.get_dummies(y_test)
Next step – define the architecture of our video classification model.
Since we do not have a very large dataset, creating a model from scratch might not work well. So, we will use a pre-trained model and take its learnings to solve our problem.
For this particular dataset, we will be using the VGG-16 pre-trained model. Let’s create a base model of the pre-trained model:
# creating the base model of pre-trained VGG16 model
base_model = VGG16(weights='imagenet', include_top=False)
This model was trained on a dataset that has 1,000 classes. We will fine-tune this model as per our requirements. include_top = False will remove the last layer of this model so that we can tune it as per our needs.
Now, we will extract features from this pre-trained model for our training and validation images:
# extracting features for training frames
X_train = base_model.predict(X_train)X_train.shape
Output: (59075, 7, 7, 512)
We have 59,075 images in the training set and the shape has been changed to (7, 7, 512) since we have passed these images through the VGG16 architecture. Similarly, we will extract features for validation frames:
# extracting features for validation frames
X_test = base_model.predict(X_test)X_test.shape
Output: (14769, 7, 7, 512)
There are 14,769 images in the validation set, and the shape of these images has also changed to (7, 7, 512). We will use a fully connected network now to fine-tune the model. This fully connected network takes input in a single dimension. So, we will reshape the images into a single dimension:
# reshaping the training as well as validation frames in single dimension
X_train = X_train.reshape(59075, 7*7*512)X_test = X_test.reshape(14769, 7*7*512)
It is always advisable to normalize the pixel values, i.e., keep them between 0 and 1. This helps the model to converge faster.
# normalizing the pixel values
max = X_train.max()X_train = X_train/maxX_test = X_test/max
Next, we will create the architecture of the model. We have to define the input shape for that. So, let’s check the shape of our images:
# shape of images
X_train.shape
Output: (59075, 25088)
The input shape will be 25,088. Let’s now create the architecture:
#defining the classifier model
architecturemodel = Sequential()
model.add(Dense(1024, activation='relu', input_shape=(25088,)))
model.add(Dropout(0.5))model.add(Dense(512, activation='relu'))model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))model.add(Dense(101, activation='softmax'))
We have multiple fully connected dense layers. I have added dropout layers as well so that the model will not overfit. The number of neurons in the final layer is equal to the number of classes that we have and hence the number of neurons here is 101.
We will now train our model using the training frames and validate the model using validation frames. We will save the weights of the model so that we will not have to retrain the model again and again.
So, let’s define a function to save the weights of the model:
# defining a function to save the weights of best model
from keras.callbacks import ModelCheckpoint
mcp_save = ModelCheckpoint('weight.hdf5', save_best_only=True, monitor='val_loss', mode='min')
We will decide the optimum model based on the validation loss. Note that the weights will be saved as weights.hdf5. You can rename the file if you wish. Before training the model, we have to compile it:
# compiling the model
model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])
We are using the categorical_crossentropy as the loss function, and the optimizer is Adam. Let’s train the model:
# training the model
model.fit(X_train, y_train, epochs=200, validation_data=(X_test, y_test), callbacks=[mcp_save], batch_size=128)
I have trained the model for 200 epochs. You can use this link to download the weights I got after training the model.
We now have the weights we will use to make predictions for the new videos. So, in the next section, we will see how well this model performs the task of video classification!
Let’s open a new Jupyter Notebook to evaluate the model. The evaluation part can also be split into multiple steps to understand the process more clearly:
You’ll be familiar with the first step – importing the required libraries:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flattenfrom
keras.layers import Conv2D, MaxPooling2Dfrom
keras.preprocessing import image
import numpy as np
import pandas as pd
from tqdm import tqdm
from keras.applications.vgg16 import VGG16
import cv2i
mport math
import os
from glob import glob
from scipy import stats as s
Next, we will define the model architecture which will be similar to what we had while training the model:
base_model = VGG16(weights='imagenet', include_top=False)
This is the pre-trained model and we will fine-tune it next:
#defining the model
architecturemodel = Sequential()
model.add(Dense(1024, activation='relu', input_shape=(25088,)))
model.add(Dropout(0.5))model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))model.add(Dense(101, activation='softmax'))
Now, as we have defined the architecture, we will now load the trained weights which we stored as weights.hdf5:
# loading the trained weights
model.load_weights("weights.hdf5")
Compile the model as well:
# compiling the model
model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])
Make sure that the loss function, optimizer, and the metrics are the same as we used while training the model.
If you’re new to the world of deep learning and computer vision, we have the perfect course for you to begin your journey:
Computer Vision using Deep Learning
You should have downloaded the train/test split files as per the official documentation of the UCF101 dataset. If not, download it from here. In the downloaded folder, there is a file named “testlist01.txt” which contains the list of test videos. We will make use of that to create the test data:
f = open("testlist01.txt", "r")t
emp = f.read()
videos = temp.split('\n')
# creating the dataframetest = pd.DataFrame()test['video_name'] = videostest = test[:-1]test_videos = test['video_name']test.head()
We now have the list of all the videos stored in a dataframe. To map the predicted categories with the actual categories, we will use the train_new.csv file:
# creating the tagstrain = pd.read_csv('UCF/train_new.csv')y = train['class']y = pd.get_dummies(y)
Now, we will make predictions for the videos in the test set.
Let me summarize what we will be doing in this step before looking at the code. The below steps will help you understand the prediction part:
Let’s code these steps and generate predictions:
# creating two lists to store predicted and actual tags
predict = []
actual = []
# for loop to extract frames from each test video
for i in tqdm(range(test_videos.shape[0])):
count = 0
videoFile = test_videos[i]
cap = cv2.VideoCapture('UCF/'+videoFile.split(' ')[0].split('/')[1])
# capturing the video from the given path
frameRate = cap.get(5) #frame rate
x=1
# removing all other files from the temp folder
files = glob('temp/*')
for f in files:
os.remove(f)
while(cap.isOpened()):
frameId = cap.get(1) #current frame number
ret, frame = cap.read()
if (ret != True):
break
if (frameId % math.floor(frameRate) == 0):
# storing the frames of this particular video in temp folder
filename ='temp/' + "_frame%d.jpg" % count;count+=1 cv2.imwrite(filename, frame)
cap.release()
# reading all the frames from temp folder
images = glob("temp/*.jpg")
prediction_images = []
for i in range(len(images)):
img = image.load_img(images[i], target_size=(224,224,3))
img = image.img_to_array(img)
img = img/255
prediction_images.append(img)
# converting all the frames for a test video into numpy array prediction_images = np.array(prediction_images)
# extracting features using pre-trained model prediction_images = base_model.predict(prediction_images)
# converting features in one dimensional array
prediction_images = prediction_images.reshape(prediction_images.shape[0], 7*7*512)
# predicting tags for each array
prediction = model.predict_classes(prediction_images)
# appending the mode of predictions in predict list to assign the tag to the video predict.append(y.columns.values[s.mode(prediction)[0][0]])
# appending the actual tag of the video actual.append(videoFile.split('/')[1].split('_')[1])
This step will take some time as there are around 3,800 videos in the test set. Once we have the predictions, we will calculate the performance of the model.
Time to evaluate our model and see what all the fuss was about.
We have the actual tags as well as the tags predicted by our model. We will make use of these to get the accuracy score. On the official documentation page of UCF101, the current accuracy is 43.90%. Can our model beat that? Let’s check!
# checking the accuracy of the predicted tags
from sklearn.metrics import accuracy_score
accuracy_score(predict, actual)*100
Output: 44.80570975416337
Great! Our model’s accuracy of 44.8% is comparable to what the official documentation states (43.9%).
You might be wondering why we are satisfied with a below 50% accuracy. Well, the reason behind this low accuracy is majorly due to lack of data. We only have around 13,000 videos and even those are of a very short duration.
In this article, we covered one of the most interesting applications of computer vision – video classification. We first understood how to deal with videos, then we extracted frames, trained a video classification model, and finally got a comparable accuracy of 44.8% on the test videos.
We can now try different approaches and aim to improve the performance of the model. Some approaches which I can think of are to use 3D Convolutions (3d cnn) which can directly deal with videos.
Since videos are a sequence of frames, we can solve it as a sequence problem as well. So, there can be multiple more solutions to this and I suggest you explore them. Feel free to share your findings with the community.
As always, if you have any suggestions or doubts related to this article, post them in the comments section below and I will be happy to answer them. And as I mentioned earlier, do check out the computer vision course if you’re new to this field.
A. Yes, Long Short-Term Memory(LSTM) networks are suitable for video classification, especially when capturing long-term dependencies and temporal sequences in videos is essential.
A. Video classification involves categorizing videos into predefined classes or labels. Deep learning models analyze temporal patterns to recognize actions, events, or objects within the video.
A. Video classification with deep learning poses challenges in capturing temporal dependencies, managing computational complexity, and handling intricate data annotation.
A. Best practices for building and training neural networks with TensorFlow or Keras include data preprocessing(normalize and augment), choosing appropriate model architecture, incorporating regularization techniques(dropout, batch normalization), experimenting with optimizers and learning rates , and monitoring and tuning hyperparameters regularly.
A. RNNs capture temporal dependencies for video classification, while CNNs extract spatial features from individual frames. Hybrid models like 3D CNNs combine both aspects for comprehensive analysis.
A. Neural networks in computer vision innovate through efficient object detection(e.g., Yolo, Faster R-CNN), detailed semantic segmentation, and the generation of realistic images using models like GANs and VAEs.
Niece tutorial, Just what i was looking for. Please can you share your notebook, thanks
What is the significance of this step? `if (frameId % math.floor(frameRate) == 0)`
not sure if this was answered, but I think it is making sure that the framerate matches with each image, so that it is accurately mapping all the images in each video
FrameRate= Number of Frames per second i.e 30 30 frames per second For every second we are taking one frame into account So after that step we have an image of every second
Hi Pulkit, First of all thanks for your useful articles. Can I get the link for the code of this project. It will be really helpful.
Hi Garima, All the codes are provided in this article itself.