Since the release of GPT models by OpenAI, such as GPT 4o, the landscape of Natural Language Processing has been changed entirely and moved to a new notion called Generative AI. Large Language Models are at the core of it, which can understand complex human queries and generate relevant answers to them. The next step to this LLM is the multimodality. That is the ability to understand data other than text. This can include images, audio, and videos. Some of the multimodels have recently been released, both open source and closed sources, like the Gemini from Google, LlaVa, and GPT 4v. Recently a newer multi-model has been announced by OpenAI called the GPT 4o (Omni). In this article, we will be creating a multimodal chatbot with this OpenAI GPT 4o
This article was published as a part of the Data Science Blogathon.
The recently announced OpenAI’s GPT-4o marks a big leap in AI for its speed, accuracy, and ability to understand and generate text, audio, and images. This “multilingual” model can translate languages, write creative content, analyze or generate speech with varying tones, and even describe real-world scenes or create images based on your descriptions. Beyond its impressive capabilities, GPT-4o integrates seamlessly with ChatGPT, allowing real-time conversations where it can identify visual information and ask relevant questions. This ability to interact across modalities paves the way for a more natural and intuitive way of interacting with computers, potentially assisting visually impaired users and creating new artistic mediums. GPT-4o stands out as a groundbreaking next-gen model that pushes the boundaries of AI
Also read: The Omniscient GPT-4o + ChatGPT is HERE!
In this section, we will begin writing the code for the gpt 4o multimodal Chatbot using GPT-4o. The first step would be to download the necessary libraries that we will need for this code to work. To do this, we run the below command
pip install openai chainlit
Running this will install the OpenAI library. This will let us work with the different OpenAI models, which include the text generation models like the GPT 4o and GPT 3.5, the image generation models like the DallE-3, and the Speech2Text models like the Whisper.
We install chainlit for creating the UI. Chainlit library lets us create quick chatbots completely in Python without writing Javascript, HTML, or CSS. Before beginning with the chatbot, we need to create some helper functions. The first one is for images. We cannot directly provide images of the model. We need to encode them to base64 and then give it. For this,
import base64
def image2base64(image_path):
with open(image_path, "rb") as img:
encoded_string =, base64.b64encode(img.read())
return encoded_string.decode("utf-8")
The gpt 4o multimodal chat even expects audio inputs. So, we even need to process the audio before sending it to the model. For this, we work with the Speech-to-text model from OpenAI.
from openai import OpenAI
client = OpenAI()
def audio_process(audio_path):
audio_file = open(audio_path, "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1", file=audio_file
)
return transcription.text
We cannot expect what type of message the user will ask the model. The user may sometimes just send plain text, sometimes may include an image, and sometimes may include an audio file. So, based on that, we need to alter the message that we will send to the OpenAI model. For this, we can write another function that will provide different user inputs to the model.
def append_messages(image_url=None, query=None, audio_transcript=None):
message_list = []
if image_url:
message_list.append({"type": "image_url", "image_url": {"url": image_url}})
if query and not audio_transcript:
message_list.append({"type": "text", "text": query})
if audio_transcript:
message_list.append({"type": "text", "text": query + "\n" + audio_transcript})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": message_list}],
max_tokens=1024,
)
return response.choices[0]
With this, we have done with creating the helper functions. We will be calling these helper functions later to pass user queries, audio, and image data and get the responses back
Also read: Here’s How You Can Use GPT 4o API for Vision, Text, Image & More.
Now, we will be building the UI part of the Chatbot. This can be built very easily with the Chainlit library. The code we will write will be in the same file where our helper functions are defined.
import chainlit as cl
@cl.on_message
async def chat(msg: cl.Message):
images = [file for file in msg.elements if "image" in file.mime]
audios = [file for file in msg.elements if "audio" in file.mime]
if len(images) > 0:
base64_image = image2base64(images[0].path)
image_url = f"data:image/png;base64,{base64_image}"
elif len(audios) > 0:
text = audio_process(audios[0].path)
response_msg = cl.Message(content="")
if len(images) == 0 and len(audios) == 0:
response = append_messages(query=msg.content)
elif len(audios) == 0:
response = append_messages(image_url=image_url, query=msg.content)
else:
response = append_messages(query=msg.content, audio_transcript=text)
response_msg.content = response.message.content
await response_msg.send()
Also read: What Can You Do With GPT-4o? | Demo
from openai import OpenAI
import base64
import chainlit as cl
client = OpenAI()
def append_messages(image_url=None, query=None, audio_transcript=None):
message_list = []
if image_url:
message_list.append({"type": "image_url", "image_url": {"url": image_url}})
if query and not audio_transcript:
message_list.append({"type": "text", "text": query})
if audio_transcript:
message_list.append({"type": "text", "text": query + "\n" + audio_transcript})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": message_list}],
max_tokens=1024,
)
return response.choices[0]
def image2base64(image_path):
with open(image_path, "rb") as img:
encoded_string = base64.b64encode(img.read())
return encoded_string.decode("utf-8")
def audio_process(audio_path):
audio_file = open(audio_path, "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1", file=audio_file
)
return transcription.text
@cl.on_message
async def chat(msg: cl.Message):
images = [file for file in msg.elements if "image" in file.mime]
audios = [file for file in msg.elements if "audio" in file.mime]
if len(images) > 0:
base64_image = image2base64(images[0].path)
image_url = f"data:image/png;base64,{base64_image}"
elif len(audios) > 0:
text = audio_process(audios[0].path)
response_msg = cl.Message(content="")
if len(images) == 0 and len(audios) == 0:
response = append_messages(query=msg.content)
elif len(audios) == 0:
response = append_messages(image_url=image_url, query=msg.content)
else:
response = append_messages(query=msg.content, audio_transcript=text)
response_msg.content = response.message.content
await response_msg.send()
To execute this, type chainlit run app.py
assuming that the code resides in a file named app.py
. After executing this command, the localhost:8000
port will become active, and we will see the image below
Now let us type in just a normal text query and see the output generated
We see that the GPT-4o successfully generated an output for the user query. We observe that the code is being highlighted here, and we can copy and paste it quickly. This is all managed by Chainlit, which handles the underlying HTML, CSS, and Javascript. Next, let us try giving an image and asking about it the model.
Here, the model has responded well to the image we uploaded. The system identified the image as an emoji and provided information about its identity and usage. Now, let us pass an audio file and test it
The speech.mp3 audio contains information about Machine Learning, so we asked the model to summarize its contents. The model generated a summary that is relevant to the content present in the audio file.
Also read: Building a MultiModal Chatbot with Gemini and Gradio
In conclusion, developing a multimodal chatbot using OpenAI’s GPT-4o (Omni) marks a great feat in AI technology, ushering in a new era of interactive experiences. Here, we’ve explored seamlessly integrating text, images, and audio inputs into conversations with the chatbot, leveraging the capabilities of GPT-4o. This innovative approach enhances user engagement and opens doors to different practical applications, from aiding visually impaired users to creating new artistic mediums. By combining the power of language understanding with multimodal capabilities, GPT-4o shows its potential to revolutionize how we interact with AI systems
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
A. GPT-4o is a groundbreaking AI model developed by OpenAI that can understand and generate text, audio, and images.
A. GPT-4o sets itself apart by integrating text, audio, and image understanding and generation in a single model.
A. GPT-4o can translate languages, create different forms of creative content, analyze or generate speech with different tones, and describe real-world scenes or create images based on descriptions.
A. The multimodal chatbot works with GPT-4o’s capabilities to understand and respond to user queries, including text, images, or audio inputs.
A. The necessary libraries for building the chatbot include OpenAI for accessing GPT-4o, Chainlit for creating the UI, and base64 for encoding images.
A. Chainlit simplifies the development of chatbot interfaces by managing underlying HTML, CSS, and JavaScript, making UI creation quick and efficient.