OpenAI has recently unveiled a suite of next-generation audio models, enhancing the capabilities of voice-enabled applications. These advancements include new speech-to-text (STT) and text-to-speech (TTS) models, offering developers more tools to create sophisticated voice agents. These advanced voice models, released on API, enable developers worldwide to build flexible and reliable voice agents much more easily. In this article, we will explore the features and applications of OpenAI’s latest GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS models. We’ll also learn how to access openAI’s audio models and try them out ourselves. So let’s get started!
OpenAI has introduced a new generation of audio models designed to enhance speech recognition and voice synthesis capabilities. These models offer improvements in accuracy, speed, and flexibility, enabling developers to build more powerful AI-driven voice applications. The suite includes 2 speech-to-text models and 1 text-to-speech model, which are:
The speech-to-text models come with advanced technologies such as noise cancellation. They are also equipped with a semantic voice activity detector that can accurately detect when the user has finished speaking. These innovations help developers handle a bunch of common issues while building voice agents. Along with these new models, OpenAI also announced that its recently launched Agents SDK now supports audio, which makes it even easier for developers to build voice agents.
Learn More: How to Use OpenAI Responses API & Agent SDK?
The advancements in these audio models are attributed to several key technical innovations:
The latest model, GPT-4o-mini tts is available on a new platform released by open AI called Openai.fm. Here’s how you can access this model:
First, head to www.openai.fm.
On the interface that opens up, choose your voice and set the vibe. If you can’t find the right character with the right vibe, click on the refresh button to get different options.
You can further customize the chosen voice with a detailed prompt. Below the vibe options, you can type in details like accent, tone, pacing, etc. to get the exact voice you want.
Once set, just type your script into the text input box on the right, and click on the ‘PLAY’ button. If you like what you hear, you can either download the audio or share it externally. If not, you can keep trying out more iterations till you get it right.
The page requires no signup and you can play with the model as you like. Moreover, on the top right corner, there’s even a toggle that’ll give you the code for the model, fine-tuned to your choices.
Now that we know how to use the model, let’s give it a try! First, let’s try out the OpenAI.fm website.
Suppose I wish to build an “Emergency Services” Voice support agent.
For this agent, I select the:
Tone: Calm, confident, and authoritative. Reassuring to keep the caller at ease while handling the situation. Professional yet empathetic, reflecting genuine concern for the caller’s well-being.
Pacing: Steady, clear, and deliberate. Not too fast to avoid panic but not too slow to delay response. Slight pauses to give the caller time to respond and process information.
Clarity: Clear, neutral accent with a well-enunciated voice. Avoid jargon or complicated terms, using simple, easy-to-understand language.
Empathy: Acknowledge the caller’s emotional state (fear, panic, etc.) without adding to it.
Offer calm reassurance and support throughout the conversation.
“Hello, this is Emergency Services. I’m here to help you. Please stay calm and listen carefully as I guide you through this situation.”
“Help is on the way, but I need a bit of information to make sure we respond quickly and appropriately.”
“Please provide me with your location. The exact address or nearby landmarks will help us get to you faster.”
“Thank you; if anyone is injured, I need you to stay with them and avoid moving them unless necessary.”
“If there’s any bleeding, apply pressure to the wound to control it. If the person is not breathing, I’ll guide you through CPR. Please stay with them and keep calm.”
“If there are no injuries, please find a safe place and stay there. Avoid danger, and wait for emergency responders to arrive.”
“You’re doing great. Stay on the line with me, and I will ensure help is on the way and keep you updated until responders arrive.”
Output:
Wasn’t that great? OpenAI’s latest audio models are now accessible through OpenAI’s API as well, enabling developers to integrate them into various applications.
Now let’s test that out.
We’ll be accessing the gpt-4o-audio-preview model via OpenAI’s API and trying out 2 tasks: one for text-to-speech, and the other for speech-to-text.
For this task, I’ll be asking the model to tell me a joke.
Code Input:
import base64
from openai import OpenAI
client = OpenAI(api_key = "OPENAI_API_KEY")
completion = client.chat.completions.create(
model="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": "Can you tell me a joke about an AI trying to tell a joke?"
}
]
)
print(completion.choices[0])
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("output.wav", "wb") as f:
f.write(wav_bytes)
Response:
For our second task, let’s give the model this audio file and see if it can tell us about the recording.
Code Input:
import base64
import requests
from openai import OpenAI
client = OpenAI(api_key = "OPENAI_API_KEY")
# Fetch the audio file and convert it to a base64 encoded string
url = "https://cdn.openai.com/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode('utf-8')
completion = client.chat.completions.create(
model="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this recording?"
},
{
"type": "input_audio",
"input_audio": {
"data": encoded_string,
"format": "wav"
}
}
]
},
]
)
print(completion.choices[0].message)
Response:
To assess the performance of its latest speech-to-text models, OpenAI conducted benchmark tests using Word Error Rate (WER), a standard metric in speech recognition. WER measures transcription accuracy by calculating the percentage of incorrect words compared to a reference transcript. A lower WER indicates better performance with fewer errors.
As the results show, the new speech-to-text models – gpt-4o-transcribe and gpt-4o-mini-transcribe – offer improved word error rates and enhanced language recognition compared to previous models like Whisper.
One of the key benchmarks used is FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech), which is a multilingual speech dataset covering over 100 languages with manually transcribed audio samples.
The results indicate that OpenAI’s new models:
Here’s how much OpenAI’s GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS models cost per million tokens:
Text Tokens
Model | Input | Output | Estimated Cost |
---|---|---|---|
gpt-4o-mini-tts | $0.60 | – | $0.015/min |
gpt-4o-transcribe | $2.50 | $10.00 | $0.006/min |
gpt-4o-mini-transcribe | $1.25 | $5.00 | $0.003/min |
Audio Tokens
Model | Input | Output | Estimated Cost |
---|---|---|---|
gpt-4o-mini-tts | – | $12.00 | $0.015/min |
gpt-4o-transcribe | $6.00 | – | $0.006/min |
gpt-4o-mini-transcribe | $3.00 | – | $0.003/min |
OpenAI’s latest audio models mark a significant shift from purely text-based agents to sophisticated voice agents, bridging the gap between AI and human-like interaction. These models don’t just understand what to say—they grasp how to say it, capturing tone, pacing, and emotion with remarkable precision. By offering both speech-to-text and text-to-speech capabilities, OpenAI enables developers to create AI-driven voice experiences that feel more natural and engaging.
The availability of these models via API means developers now have greater control over both the content and delivery of AI-generated speech. Additionally, OpenAI’s Agents SDK makes it easier to transform traditional text-based agents into fully functional voice agents, opening up new possibilities for customer service, accessibility tools, and real-time communication applications. As OpenAI continues to refine its voice technology, these advancements set a new standard for AI-powered interactions.
A. OpenAI has introduced three new audio models—GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS. These models are designed to enhance speech-to-text and text-to-speech capabilities, enabling more accurate transcriptions and natural-sounding AI-generated speech.
A. Compared to OpenAI’s Whisper models, the new GPT-4o audio models offer improved transcription accuracy and lower word error rates. It also offers enhanced multilingual support and better real-time responsiveness. Additionally, the text-to-speech model provides more natural voice modulation, allowing users to adjust tone, style, and pacing for more lifelike AI-generated speech.
A. The new TTS model allows users to generate speech with customizable styles, tones, and pacing. It enhances human-like voice modulation and supports diverse use cases, from AI voice assistants to audiobook narration. The model also provides better emotional expression and clarity than previous iterations.
A. GPT-4o-Transcribe offers industry-leading transcription accuracy, making it ideal for professional use cases like meeting transcriptions and customer service logs. GPT-4o-Mini-Transcribe is optimized for efficiency and speed, catering to real-time applications such as live captions and interactive AI agents.
A. OpenAI.fm is a web platform where users can test OpenAI’s text-to-speech model without signing up. Users can select a voice, adjust the tone, enter a script, and generate audio instantly. The platform also provides the underlying API code for further customization.
A. Yes, OpenAI’s Agents SDK now supports audio, allowing developers to convert text-based agents into interactive voice agents. This makes it easier to create AI-powered customer support bots, accessibility tools, and personalized AI assistants with advanced voice capabilities.