X-CLIP: Advancing Video Recognition with Language-Image Pretraining

Sahitya Arya Last Updated : 30 Mar, 2024

5 min read

Introduction

Video recognition is a cornerstone of modern computer vision, enabling machines to understand and interpret visual content in videos. With the rapid evolution of convolutional neural networks (CNNs) and transformers, significant strides have been made in enhancing the accuracy and efficiency of video recognition systems. However, traditional approaches are often constrained by closed-set learning paradigms, limiting their ability to adapt to new and emerging categories in real-world scenarios. In response to the longstanding challenges encountered by traditional methods in video recognition, a groundbreaking and transformative model known as X-CLIP has emerged.

In this comprehensive exploration, we delve deep into X-CLIP’s revolutionary capabilities. We dissect its core architecture, unraveling the intricate mechanisms that power its exceptional performance. Additionally, we spotlight its remarkable zero/few-shot transfer learning capabilities, showcasing how it is revolutionizing the landscape of AI-powered video analysis.

Come along on this enlightening exploration as we uncover X-CLIP’s complete capabilities and its significant implications for the future of video recognition and artificial intelligence.

X-CLIP: Advancing Video Recognition with Language-Image Pretraining

Learning Objectives:

Understand the importance of cross-modality pretraining in video recognition.
Explore the architecture and components of X-CLIP for effective video analysis.
Learn how to use X-CLIP for zero-shot video classification tasks.
Gain insights into the benefits and implications of leveraging language-image models for video understanding.

What is X-CLIP?
Overview of the Model
Video Encoder Architecture
Text Encoder with Video-Specific Prompting
Zero-Shot Video Classification
Zero-Shot Classification

So, how does X-CLIP achieve this remarkable feat?

What is X-CLIP?

X-CLIP is a cutting-edge model that is not just an incremental improvement but represents a paradigm shift in how we approach video understanding. It is founded on the principles of contrastive language-image pretraining, a sophisticated technique that synergistically integrates natural language processing and visual perception.

X-CLIP’s arrival signifies a significant advancement in video recognition, offering a holistic approach beyond conventional methods. Its unique architecture and innovative methodologies enable it to achieve unparalleled accuracy in video analysis tasks. Moreover, what sets X-CLIP apart is its ability to seamlessly adapt to novel and diverse categories of videos, even when faced with limited training data.

Overview of the Model

Unlike traditional video recognition methods that rely on supervised feature embeddings with one-hot labels, X-CLIP leverages text as supervision, providing richer semantic information. The approach involves training a video encoder and a text encoder simultaneously to align video and text representations effectively.

Rather than starting from scratch with a new video-text model, X-CLIP builds upon existing language-image models, enhancing them with video temporal modeling and video-adaptive textual prompts. This strategy maximizes the utilization of large-scale pretrained models while seamlessly transferring their robust generalizability from images to videos.

Learn More: Deep Learning Tutorial to Build Video Classification Model

Video Encoder Architecture

The core of X-CLIP’s video encoder lies in its innovative design, consisting of two primary components:

A cross-frame communication transformer and a multi-frame integration transformer. These transformers work in tandem to capture global spatial and temporal information from video frames, enabling efficient representation learning.

The cross-frame communication transformer facilitates information exchange between frames, allowing for the abstraction and communication of visual information across the entire video. This is achieved through a sophisticated attention mechanism that models spatio-temporal dependencies effectively.

Text Encoder with Video-Specific Prompting

X-CLIP’s text encoder is augmented with a video-specific prompting scheme, enhancing text representation with contextual information from videos. Unlike manual prompt designs, which often fail to improve performance, X-CLIP’s learnable prompting mechanism dynamically generates textual representations tailored to each video’s content.

By leveraging the synergy between video content and text embeddings, it enhances the discriminative power of textual prompts, enabling more accurate and context-aware video recognition.

Now, let’s move on to how to use the X-CLIP Model.

Zero-Shot Video Classification

Set-up Environment

We first install 🤗 Transformers, record, and Pytube.

!pip install -q git+https://github.com/huggingface/transformers.git

!pip install -q pytube decord

Load Video

Here you can provide any YouTube video you like! Just provide the URL 🙂 in my case, I’m providing a YouTube video of playing football games.

from pytube import YouTube

youtube_url = 'https://youtu.be/VMj-3S1tku0'

yt = YouTube(youtube_url)

streams = yt.streams.filter(file_extension='mp4')

print(streams)

print(len(streams))

file_path = streams[0].download()

Sample Frames

The X-CLIP model we’ll use expects 32 frames for a given video. Let’s sample them:

from decord import VideoReader, cpu

import torch

import numpy as np

from huggingface_hub import hf_hub_download

np.random.seed(0)

def sample_frame_indices(clip_len, frame_sample_rate, seg_len):

   converted_len = int(clip_len * frame_sample_rate)

   end_idx = np.random.randint(converted_len, seg_len)

   start_idx = end_idx - converted_len

   indices = np.linspace(start_idx, end_idx, num=clip_len)

   indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)

   return indices

videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))

# sample 32 frames

videoreader.seek(0)

indices = sample_frame_indices(clip_len=32, frame_sample_rate=4, seg_len=len(videoreader))

video = videoreader.get_batch(indices).asnumpy()

Let’s visualize the first frame!

from PIL import Image

Image.fromarray(video[0])

Visualization of 32 frames

import matplotlib.pyplot as plt

# Visualize 32 frames

fig, axs = plt.subplots(4, 8, figsize=(16, 8))

fig.suptitle('Sampled Frames from Video')

axs = axs.flatten()

for i in range(32):

   axs[i].imshow(video[i])

   axs[i].axis('off')

plt.tight_layout()

plt.savefig('Frames')

plt.show()

Load X-CLIP model

Let’s instantiate the XCLIP model, along with its processor.

from transformers import XCLIPProcessor, XCLIPModel

model_name = "microsoft/xclip-base-patch16-zero-shot"

processor = XCLIPProcessor.from_pretrained(model_name)

model = XCLIPModel.from_pretrained(model_name)

Zero-Shot Classification

Usage of X-CLIP is identical to CLIP: You can feed it a bunch of texts, and the model determines which ones go best with the video.

import torch

input_text=["programming course", "eating spaghetti", "playing football"]

inputs = processor(text=input_text, videos=list(video), return_tensors="pt", padding=True)

# forward pass

with torch.no_grad():

   outputs = model(**inputs)

probs = outputs.logits_per_video.softmax(dim=1)

probs

Output

max_prob=torch.argmax(probs)

print(f'Video is about :  {input_text[max_prob]}')

Outputs

Output Zero-Shot Video Classification | X-CLIP

Conclusion

In conclusion, X-CLIP represents a groundbreaking advancement in video recognition, leveraging cross-modality pretraining to achieve remarkable accuracy and adaptability. By combining language understanding with visual perception, X-CLIP opens up new possibilities in understanding and interpreting video content. Its innovative architecture, seamless integration of temporal cues and textual prompts, and robust performance in zero/few-shot scenarios make it a game-changer in the field of AI-powered video analysis.

Key Takeaways

X-CLIP combines language and visual information for enhanced video recognition.
Its cross-frame communication transformer and video-specific prompting scheme improve representation learning.
Zero-shot classification with X-CLIP demonstrates its adaptability to novel categories.
It leverages pretraining on large-scale datasets for robust and context-aware video analysis.

Frequently Asked Questions

Q1. What is X-CLIP?

A. X-CLIP model integrates language understanding and visual perception for video recognition tasks.

Q2. How does X-CLIP improve video recognition?

A. X-CLIP leverages cross-modality pretraining, innovative architectures, and video-specific prompting to enhance accuracy and adaptability.

Q3. Can X-CLIP handle zero-shot video classification?

A. Yes, X-CLIP demonstrates strong performance in zero-shot scenarios, adapting to unseen categories with minimal training data.

Sahitya Arya

I'm Sahitya Arya, a seasoned Deep Learning Engineer with one year of hands-on experience in both Deep Learning and Machine Learning. Throughout my career, I've authored more than three research papers and have gained a profound understanding of Deep Learning techniques. Additionally, I possess expertise in Large Language Models (LLMs), contributing to my comprehensive skill set in cutting-edge technologies for artificial intelligence.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to CNN and implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Experiments with Generative Adversarial Networks

Zero and Few Shot Learning

Model Deployment

X-CLIP: Advancing Video Recognition with Language-Image Pretraining

Introduction

Learning Objectives:

Table of contents

What is X-CLIP?

Overview of the Model

Video Encoder Architecture

Text Encoder with Video-Specific Prompting

Zero-Shot Video Classification

Set-up Environment

Load Video

Sample Frames

Visualization of 32 frames

Load X-CLIP model

Zero-Shot Classification

Output

Outputs

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)