All About ChatGPT-4 Vision’s Image and Video Capabilities

Badrinarayan M Last Updated : 14 Jul, 2024

8 min read

Introduction

By incorporating visual capabilities into the potent language model GPT-4, ChatGPT-4 Vision, or GPT-4V, signifies a noteworthy breakthrough in the field of artificial intelligence. With this improvement, the model can now process, comprehend, and produce visual content, making it a flexible tool suitable for various uses. The primary functions of ChatGPT-4 Vision, such as image analysis, video analysis, and image generation, will be covered in detail in this article, along with some examples of how these features could be used in different contexts.

Overview

ChatGPT-4 Vision integrates visual capabilities with GPT-4, enabling image and video processing alongside text generation.
Image analysis by ChatGPT-4 Vision includes object detection, classification, and scene understanding, offering accurate and efficient insights.
Key features include object detection for automated tasks, image classification for various industries, and scene understanding for advanced applications.
ChatGPT-4 Vision can generate images from text descriptions, providing innovative solutions for design, content creation, and more.
Video analysis capabilities of ChatGPT-4 Vision include action recognition, motion detection, and event identification, enhancing various fields like security and sports analytics.
Practical applications span healthcare diagnostics, retail visual search, security surveillance, and interactive learning, demonstrating ChatGPT-4 Vision’s versatility.

Image Analysis

Extracting useful information from images is known as image analysis. It allows for the completion of tasks like object detection, image classification, and scene comprehension. With its sophisticated neural network architecture, ChatGPT-4 Vision is able to complete these tasks with a high degree of efficiency and accuracy.

Key Features

Object Detection is the process of finding and identifying items in an image. Its uses include inventory management, driverless cars, and automated surveillance.
Image classification: Classifying images into predetermined groups is known as image classification. This helps with disease identification in medical imaging, social media content moderation, and retail product classification.
Understanding the scene: Examining the background and connections between the many elements in a picture can be beneficial for applications in robots, augmented reality, and virtual help.

Example Use Case

ChatGPT-4 Vision in a smart home security system may examine security camera footage to find anomalous activity or intruders. It can categorize things like people, pets, and cars and set off alarms according to pre-established security guidelines.

New Feature

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Implementation of Image Analysis

First, let’s install the necessary dependencies

!pip install openai
!pip install requests

Importing necessary libraries

import openai
import requests
import base64
from openai import OpenAI
from PIL import Image
from io import BytesIO
from IPython.display import display

Image Analysis with url

client = OpenAI(api_key='Enter your Key')
response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
   {
     "role": "user",
     "content": [
       {"type": "text", "text": "Describe me this image"},
       {
         "type": "image_url",
         "image_url": {
           "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
         },
       },
     ],
   }
 ],
 max_tokens=300,
)

response.choices[0].message.content

In the above code, we are passing the url of the image along with the prompt to describe the image in the url. Below is the image which we are passing.

Output

Image Analysis with Local Images

api_key = "Enter your key"
def encode_image(image_path):
 with open(image_path, "rb") as image_file:
   return base64.b64encode(image_file.read()).decode('utf-8')


# Path to your image
image_path = "/content/cat.jpeg"


# Getting the base64 string
base64_image = encode_image(image_path)


headers = {
 "Content-Type": "application/json",
 "Authorization": f"Bearer {api_key}"
}


payload = {
 "model": "gpt-4o",
 "messages": [
   {
     "role": "user",
     "content": [
       {
         "type": "text",
         "text": "Describe me this image"
       },
       {
         "type": "image_url",
         "image_url": {
           "url": f"data:image/jpeg;base64,{base64_image}"
         }
       }
     ]
   }
 ],
 "max_tokens": 300
}


response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

In the above, we pass the image of the cat below, showing the mode to describe the image.

Output

print(response.json()["choices"][0]["message"]["content"])

Passing multiple images

from openai import OpenAI


client = OpenAI(api_key='Enter your Key')
response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
   {
     "role": "user",
     "content": [
       {
         "type": "text",
         "text": "Tell me the difference and similarities of these two images",
       },
       {
         "type": "image_url",
         "image_url": {
           "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Walking_tiger_female.jpg/1920px-Walking_tiger_female.jpg",
         },
       },
       {
         "type": "image_url",
         "image_url": {
           "url": "https://upload.wikimedia.org/wikipedia/commons/7/73/Lion_waiting_in_Namibia.jpg",
         },
       },
     ],
   }
 ],
 max_tokens=300,
)

In the above code, we pass in multiple images using their URLs. Below are the images that we are passing.

We prompted the comparison of these two images to find their similarities and differences.

Output

print(response.choices[0].message.content)

Image Generation

One of ChatGPT-4 Vision’s most intriguing features is its capacity to produce visuals from textual descriptions. This creates new opportunities for design, content production, and creative applications.

Key Features

Text-to-Image Generation: the process of producing visuals from comprehensive written descriptions. This has applications in the entertainment, education, and advertising sectors.
Style Transfer: Transferring an image’s style to another is known as style transfer. This helps create material on social networking, graphic design, and digital art.
Image editing is the process of altering preexisting images in response to text instructions. It can improve activities involving manipulation, restoration, and photo editing.

Example Use Case

Designers in the fashion business can use ChatGPT-4 Vision to create visuals of garment designs from written descriptions. This can speed up the design process, enable virtual prototyping, and improve idea exchange.

Also read: Here’s How You Can Use GPT 4o API for Vision, Text, Image & More.

Implementation of Image Generation

The Images API provides three methods for interacting with images:

Creating images from scratch based on a text prompt (DALL- E 3 and DALL – E 2)
Creating variations of an existing image (DALL – E 2 only)

Creating Images using prompt

from openai import OpenAI
client = OpenAI(api_key='Enter your key')


response = client.images.generate(
 model="dall-e-3",
 prompt="a white siamese cat",
 size="1024x1024",
 quality="standard",
 n=1,
)


image_url = response.data[0].url

We have prompted the DALL-E 3 mode to create a white Siamese cat image.

# Download the image
image_response = requests.get(image_url)

# Open the image using PIL
image = Image.open(BytesIO(image_response.content))

# Display the image
display(image)

Output

Image variation of an existing image

from openai import OpenAI
client = OpenAI(api_key='Enter your key')


response = client.images.create_variation(
 model="dall-e-2",
 image=open("/content/spider_man.png", "rb"),
 n=1,
 size="1024x1024"
)


image_url = response.data[0].url

We are using DALL-E 2 to create a variation of the existing image. We are passing the below image to the API to create a variation.

# Download the image
image_response = requests.get(image_url)

# Open the image using PIL
image = Image.open(BytesIO(image_response.content))

# Display the image
display(image)

Output

We can see that the model has created a variation of our image.

Video Analysis

Actionable insights can be extracted through the processing of video streams, expanding the scope of picture analysis into the temporal domain. Action identification, motion detection, and event detection in videos are among the functions that ChatGPT-4 Vision is capable of.

Key Features

Action Recognition: Recognising particular movements made by participants in a video. This can be used in surveillance, human-computer interaction, and sports analytics.
Motion detection: This can benefit animation, video surveillance, and traffic monitoring applications.
Event detection: It is the process of locating important occurrences in a video. It can be applied in various fields, including security for incident detection, entertainment for automated highlight generation, and healthcare for patient activity monitoring.

Example Use case

ChatGPT-4 Vision can analyze game videos in sports analytics to identify player activities like basketball dribbling, shooting, and passing. This data can provide insights into player performance, game strategy, and training efficacy.

Also read: How to Use DALL-E 3 API for Image Generation?

Implementation of Video Analysis

import cv2
import base64
import requests


def encode_image(image):
   _, buffer = cv2.imencode('.jpg', image)
   return base64.b64encode(buffer).decode('utf-8')


def extract_frames(video_path, frame_interval=30):
   cap = cv2.VideoCapture(video_path)
   frames = []
   frame_count = 0


   while cap.isOpened():
       ret, frame = cap.read()
       if not ret:
           break
       if frame_count % frame_interval == 0:
           frames.append(frame)
       frame_count += 1


   cap.release()
   return frames


def analyze_frame(frame, api_key):
   base64_image = encode_image(frame)
   headers = {
       "Content-Type": "application/json",
       "Authorization": f"Bearer {api_key}"
   }


   payload = {
       "model": "gpt-4o",
       "messages": [
           {
               "role": "user",
               "content": [
                   {
                       "type": "text",
                       "text": "Describe me this image"
                   },
                   {
                       "type": "image_url",
                       "image_url": {
                           "url": f"data:image/jpeg;base64,{base64_image}"
                       }
                   }
               ]
           }
       ],
       "max_tokens": 300
   }


   response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
   return response.json()


def analyze_video(video_path, api_key, frame_interval=30):
   frames = extract_frames(video_path, frame_interval)
   analysis_results = []


   for frame in frames:
       result = analyze_frame(frame, api_key)
       analysis_results.append(result)


   return analysis_results


# Path to your video
video_path = "/content/Kendall_Jenner.mp4"
api_key = "Enter your key"


# Analyze the video
results = analyze_video(video_path, api_key)


for result in results:
   print(result['choices'][0]["message"]["content"])

In the above code, we are taking a video of a celebrity doing a ramp walk; we are taking our frames at an interval of 30 and making an API call to know the description.

Output

Also read: Guide to Language Processing with GPT-4 in Artificial Intelligence

Practical Applications of GPT-4 Vision

Here are the applications of GPT-4 Vision:

Medical Care

In the medical field, GPT-4 Vision uses image analysis to help diagnose diseases, such as MRIs and X-rays. It can help medical practitioners make well-informed decisions by highlighting areas of concern and offering second viewpoints.

For instance

Medical imaging analysis identifies anomalies in X-rays, such as tumors or fractures, and gives radiologists comprehensive descriptions of these findings.

E-commerce and retail

GPT-4 Vision improves the shopping experience for both retail and online customers by offering thorough product descriptions and visual search features. Customers can upload photographs to locate related items or recommendations based on their visual preferences.

For instance

Visual Search: Enabling customers to contribute photographs in order to search for products, such as locating a dress that resembles one that a famous person has worn.

Automated Product Descriptions: Generating detailed product descriptions based on images, improving catalog management and user experience.

Conclusion

GPT-4 Vision is a revolutionary advancement in artificial intelligence that seamlessly combines natural language comprehension with visual analysis. Its applications are used in various sectors, including healthcare, retail, security, and education. They offer creative solutions and improve user experiences. Using sophisticated transformer topologies and multimodal learning, GPT-4 Vision creates new avenues for engaging with and comprehending the visual world.

Frequently Asked Questions

Q1. What is GPT-4 Vision?

Ans. GPT-4 Vision is an advanced AI model that integrates natural language processing with image and video analysis capabilities, allowing for detailed interpretation and generation of visual content.

Q2. What are the primary applications of GPT-4 Vision?

Ans. Key applications include healthcare (medical imaging analysis), retail (visual search and product descriptions), security (video surveillance and intrusion detection), and education (interactive learning and assignment evaluation).

Q3. How does GPT-4 Vision perform image analysis?

Ans. GPT-4 Vision identifies objects, scenes, and activities within images and generates detailed natural language descriptions of the visual content.

Q4. Can GPT-4 Vision analyze videos?

Ans. Yes, GPT-4 Vision can analyze sequences of frames in videos to identify actions, events, and changes over time, enhancing applications in security, entertainment, and more.

Q5. Is GPT-4 Vision capable of generating images?

Ans. Yes, GPT-4 Vision can generate images from textual descriptions, which is useful in creative design and prototyping applications.

Badrinarayan M

Data science Trainee at Analytics Vidhya, specializing in ML, DL and Gen AI. Dedicated to sharing insights through articles on these subjects. Eager to learn and contribute to the field's advancements. Passionate about leveraging data to solve complex problems and drive innovation.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

All About ChatGPT-4 Vision’s Image and Video Capabilities

Introduction

Overview

Table of contents

Image Analysis

Key Features

Example Use Case

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Implementation of Image Analysis

Importing necessary libraries

Image Analysis with url

Output

Image Analysis with Local Images

Output

Passing multiple images

Output

Image Generation

Key Features

Example Use Case

Implementation of Image Generation

Creating Images using prompt

Output

Image variation of an existing image

Output

Video Analysis

Key Features

Example Use case

Implementation of Video Analysis

Output

Practical Applications of GPT-4 Vision

Medical Care

For instance

E-commerce and retail

For instance

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect