OpenAI’s Audio Models: How to Access, Features, Applications, and More

K.C. Sabreena Basheer Last Updated : 21 Mar, 2025

8 min read

OpenAI has recently unveiled a suite of next-generation audio models, enhancing the capabilities of voice-enabled applications. These advancements include new speech-to-text (STT) and text-to-speech (TTS) models, offering developers more tools to create sophisticated voice agents. These advanced voice models, released on API, enable developers worldwide to build flexible and reliable voice agents much more easily. In this article, we will explore the features and applications of OpenAI’s latest GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS models. We’ll also learn how to access openAI’s audio models and try them out ourselves. So let’s get started!

OpenAI’s New Audio Models
- Technical Innovations Behind OpenAI’s Audio Models
How to Access OpenAI’s Audio Models
Hands-on Testing of OpenAI’s Audio Models
- 1. Using GPT-4o-Mini-Transcribe on OpenAI.fm
- 2. Using gpt-4o-audio-preview via API
Benchmark Results of OpenAI’s Audio Models
- Performance on FLEURS Benchmark
Cost of OpenAI’s Audio Models
Conclusion
Frequently Asked Questions

OpenAI’s New Audio Models

OpenAI has introduced a new generation of audio models designed to enhance speech recognition and voice synthesis capabilities. These models offer improvements in accuracy, speed, and flexibility, enabling developers to build more powerful AI-driven voice applications. The suite includes 2 speech-to-text models and 1 text-to-speech model, which are:

GPT-4o-Transcribe: OpenAI’s most advanced speech-to-text model, offering industry-leading transcription accuracy. It is designed for applications that require precise and reliable transcriptions, such as meeting and lecture transcriptions, customer service call logs, and content subtitling.
GPT-4o-Mini-Transcribe: A smaller, lightweight, and more efficient version of the above transcription model. It is optimized for lower-latency applications such as live captions, voice commands, and interactive AI agents. It provides faster transcription speeds, lower computational costs, and a balance between accuracy and efficiency.
GPT-4o-mini TTS: This model introduces the ability to instruct the AI to speak in specific styles or tones, making AI-generated voices sound more human-like. Developers can now tailor the agent’s voice tone to match different contexts like friendly, professional, or dramatic. It works well with OpenAI’s speech-to-text models, enabling smooth voice interactions.

The speech-to-text models come with advanced technologies such as noise cancellation. They are also equipped with a semantic voice activity detector that can accurately detect when the user has finished speaking. These innovations help developers handle a bunch of common issues while building voice agents. Along with these new models, OpenAI also announced that its recently launched Agents SDK now supports audio, which makes it even easier for developers to build voice agents.

Learn More: How to Use OpenAI Responses API & Agent SDK?

Technical Innovations Behind OpenAI’s Audio Models

The advancements in these audio models are attributed to several key technical innovations:

Pretraining with Authentic Audio Datasets: Leveraging extensive and diverse audio data has enriched the models’ ability to understand and generate human-like speech patterns.
Advanced Distillation Methodologies: These techniques have been employed to optimize model performance, ensuring efficiency without compromising quality.
Reinforcement Learning Paradigm: Implementing reinforcement learning has contributed to the models’ improved accuracy and adaptability in various speech scenarios.

How to Access OpenAI’s Audio Models

The latest model, GPT-4o-mini tts is available on a new platform released by open AI called Openai.fm. Here’s how you can access this model:

Open the Website
First, head to www.openai.fm.
Choose the Voice and Vibe
On the interface that opens up, choose your voice and set the vibe. If you can’t find the right character with the right vibe, click on the refresh button to get different options.
Fine-tune the Voice
You can further customize the chosen voice with a detailed prompt. Below the vibe options, you can type in details like accent, tone, pacing, etc. to get the exact voice you want.
Add the Script and Play
Once set, just type your script into the text input box on the right, and click on the ‘PLAY’ button. If you like what you hear, you can either download the audio or share it externally. If not, you can keep trying out more iterations till you get it right.

how to access GPT-4o-mini tts on openai.fm

The page requires no signup and you can play with the model as you like. Moreover, on the top right corner, there’s even a toggle that’ll give you the code for the model, fine-tuned to your choices.

Hands-on Testing of OpenAI’s Audio Models

Now that we know how to use the model, let’s give it a try! First, let’s try out the OpenAI.fm website.

1. Using GPT-4o-Mini-Transcribe on OpenAI.fm

Suppose I wish to build an “Emergency Services” Voice support agent.

For this agent, I select the:

Voice – Nova
Vibe – Sympathetic

Use the Following Instructions:

Tone: Calm, confident, and authoritative. Reassuring to keep the caller at ease while handling the situation. Professional yet empathetic, reflecting genuine concern for the caller’s well-being.

Pacing: Steady, clear, and deliberate. Not too fast to avoid panic but not too slow to delay response. Slight pauses to give the caller time to respond and process information.

Clarity: Clear, neutral accent with a well-enunciated voice. Avoid jargon or complicated terms, using simple, easy-to-understand language.

Empathy: Acknowledge the caller’s emotional state (fear, panic, etc.) without adding to it.

Offer calm reassurance and support throughout the conversation.

Use the Following Script:

“Hello, this is Emergency Services. I’m here to help you. Please stay calm and listen carefully as I guide you through this situation.”

“Help is on the way, but I need a bit of information to make sure we respond quickly and appropriately.”

“Please provide me with your location. The exact address or nearby landmarks will help us get to you faster.”

“Thank you; if anyone is injured, I need you to stay with them and avoid moving them unless necessary.”

“If there’s any bleeding, apply pressure to the wound to control it. If the person is not breathing, I’ll guide you through CPR. Please stay with them and keep calm.”

“If there are no injuries, please find a safe place and stay there. Avoid danger, and wait for emergency responders to arrive.”

“You’re doing great. Stay on the line with me, and I will ensure help is on the way and keep you updated until responders arrive.”

OpenAI's audio model application on openai.fm

Output:

Wasn’t that great? OpenAI’s latest audio models are now accessible through OpenAI’s API as well, enabling developers to integrate them into various applications.

Now let’s test that out.

2. Using gpt-4o-audio-preview via API

We’ll be accessing the gpt-4o-audio-preview model via OpenAI’s API and trying out 2 tasks: one for text-to-speech, and the other for speech-to-text.

Task 1: Text-to-Speech

For this task, I’ll be asking the model to tell me a joke.

Code Input:

import base64
from openai import OpenAI


client = OpenAI(api_key = "OPENAI_API_KEY")
completion = client.chat.completions.create(
   model="gpt-4o-audio-preview",
   modalities=["text", "audio"],
   audio={"voice": "alloy", "format": "wav"},
   messages=[
       {
           "role": "user",
           "content": "Can you tell me a joke about an AI trying to tell a joke?"
       }
   ]
)
print(completion.choices[0])
wav_bytes = base64.b64decode(completion.choices[0].message.audio.data)
with open("output.wav", "wb") as f:
   f.write(wav_bytes)

Response:

Task 2: Speech-to-Text

For our second task, let’s give the model this audio file and see if it can tell us about the recording.

Code Input:

import base64
import requests
from openai import OpenAI
client = OpenAI(api_key = "OPENAI_API_KEY")


# Fetch the audio file and convert it to a base64 encoded string
url = "https://cdn.openai.com/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode('utf-8')


completion = client.chat.completions.create(
   model="gpt-4o-audio-preview",
   modalities=["text", "audio"],
   audio={"voice": "alloy", "format": "wav"},
   messages=[
       {
           "role": "user",
           "content": [
               {
                   "type": "text",
                   "text": "What is in this recording?"
               },
               {
                   "type": "input_audio",
                   "input_audio": {
                       "data": encoded_string,
                       "format": "wav"
                   }
               }
           ]
       },
   ]
)
print(completion.choices[0].message)

Response:

Benchmark Results of OpenAI’s Audio Models

To assess the performance of its latest speech-to-text models, OpenAI conducted benchmark tests using Word Error Rate (WER), a standard metric in speech recognition. WER measures transcription accuracy by calculating the percentage of incorrect words compared to a reference transcript. A lower WER indicates better performance with fewer errors.

OpenAI's GPT-4o-Transcribe and GPT-4o-Mini-Transcribe, benchmarks

As the results show, the new speech-to-text models – gpt-4o-transcribe and gpt-4o-mini-transcribe – offer improved word error rates and enhanced language recognition compared to previous models like Whisper.

Performance on FLEURS Benchmark

One of the key benchmarks used is FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech), which is a multilingual speech dataset covering over 100 languages with manually transcribed audio samples.

The results indicate that OpenAI’s new models:

Achieve lower WER across multiple languages, demonstrating improved transcription accuracy.
Show stronger multilingual coverage, making them more reliable for diverse linguistic applications.
Outperform Whisper v2 and Whisper v3, OpenAI’s previous-generation models, in all evaluated languages.

Cost of OpenAI’s Audio Models

Here’s how much OpenAI’s GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS models cost per million tokens:

Text Tokens

Model	Input	Output	Estimated Cost
gpt-4o-mini-tts	$0.60	–	$0.015/min
gpt-4o-transcribe	$2.50	$10.00	$0.006/min
gpt-4o-mini-transcribe	$1.25	$5.00	$0.003/min

Audio Tokens

Model	Input	Output	Estimated Cost
gpt-4o-mini-tts	–	$12.00	$0.015/min
gpt-4o-transcribe	$6.00	–	$0.006/min
gpt-4o-mini-transcribe	$3.00	–	$0.003/min

Conclusion

OpenAI’s latest audio models mark a significant shift from purely text-based agents to sophisticated voice agents, bridging the gap between AI and human-like interaction. These models don’t just understand what to say—they grasp how to say it, capturing tone, pacing, and emotion with remarkable precision. By offering both speech-to-text and text-to-speech capabilities, OpenAI enables developers to create AI-driven voice experiences that feel more natural and engaging.

The availability of these models via API means developers now have greater control over both the content and delivery of AI-generated speech. Additionally, OpenAI’s Agents SDK makes it easier to transform traditional text-based agents into fully functional voice agents, opening up new possibilities for customer service, accessibility tools, and real-time communication applications. As OpenAI continues to refine its voice technology, these advancements set a new standard for AI-powered interactions.

Frequently Asked Questions

Q1. What are OpenAI’s new audio models?

A. OpenAI has introduced three new audio models—GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS. These models are designed to enhance speech-to-text and text-to-speech capabilities, enabling more accurate transcriptions and natural-sounding AI-generated speech.

Q2. How are OpenAI’s new audio models different from Whisper?

A. Compared to OpenAI’s Whisper models, the new GPT-4o audio models offer improved transcription accuracy and lower word error rates. It also offers enhanced multilingual support and better real-time responsiveness. Additionally, the text-to-speech model provides more natural voice modulation, allowing users to adjust tone, style, and pacing for more lifelike AI-generated speech.

Q3. What are the key features of OpenAI’s new text-to-speech (TTS) model?

A. The new TTS model allows users to generate speech with customizable styles, tones, and pacing. It enhances human-like voice modulation and supports diverse use cases, from AI voice assistants to audiobook narration. The model also provides better emotional expression and clarity than previous iterations.

Q4. How are GPT-4o-Transcribe and GPT-4o-Mini-Transcribe different?

A. GPT-4o-Transcribe offers industry-leading transcription accuracy, making it ideal for professional use cases like meeting transcriptions and customer service logs. GPT-4o-Mini-Transcribe is optimized for efficiency and speed, catering to real-time applications such as live captions and interactive AI agents.

Q5. What is OpenAI.fm?

A. OpenAI.fm is a web platform where users can test OpenAI’s text-to-speech model without signing up. Users can select a voice, adjust the tone, enter a script, and generate audio instantly. The platform also provides the underlying API code for further customization.

Q6. Can OpenAI’s Agents SDK help developers build voice agents?

A. Yes, OpenAI’s Agents SDK now supports audio, allowing developers to convert text-based agents into interactive voice agents. This makes it easier to create AI-powered customer support bots, accessibility tools, and personalized AI assistants with advanced voice capabilities.

K.C. Sabreena Basheer

Sabreena is a GenAI enthusiast and tech editor who's passionate about documenting the latest advancements that shape the world. She's currently exploring the world of AI and Data Science as the Manager of Content & Growth at Analytics Vidhya.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

OpenAI’s Audio Models: How to Access, Features, Applications, and More

Table of Contents

OpenAI’s New Audio Models

Technical Innovations Behind OpenAI’s Audio Models

How to Access OpenAI’s Audio Models

Hands-on Testing of OpenAI’s Audio Models

1. Using GPT-4o-Mini-Transcribe on OpenAI.fm

Use the Following Instructions:

Use the Following Script:

2. Using gpt-4o-audio-preview via API

Task 1: Text-to-Speech

Task 2: Speech-to-Text

Benchmark Results of OpenAI’s Audio Models

Performance on FLEURS Benchmark

Cost of OpenAI’s Audio Models

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg