OpenAI’s Agent SDK has taken things up a notch with the release of its Voice Agent feature, enabling you to create intelligent, real-time, speech-driven applications. Whether you’re building a language tutor, a virtual assistant, or a support bot, this new capability brings in a whole new level of interaction—natural, dynamic, and human-like. Let’s break it down and walk through what it is, how it works, and how you can build a Multilingual Voice Agent yourself.
A Voice Agent is a system that listens to your voice, understands what you’re saying, thinks about a response, and then replies out loud. The magic is powered by a combination of speech-to-text, language models, and text-to-speech technologies.
The OpenAI Agent SDK makes this incredibly accessible through something called a VoicePipeline—a structured 3-step process:
Depending on your use case, you’ll want to pick one of two core architectures supported by OpenAI:
This is the real-time, all-audio approach using models like gpt-4o-realtime-preview. Instead of translating to text behind the scenes, the model processes and generates speech directly.
Perfect for:
Strengths | Best For |
---|---|
Low latency | Interactive, unstructured dialogue |
Multimodal understanding (voice, tone, pauses) | Real-time engagement |
Emotion-aware replies | Customer support, virtual companions |
This approach makes conversations feel fluid and human but may need more attention in edge cases like logging or exact transcripts.
The chained method is more traditional: Speech gets turned into text, the LLM processes that text, and then the reply is turned back into speech. The recommended models here are:
Perfect for:
Strengths | Best For |
---|---|
High control & transparency | Structured workflows |
Reliable, text-based processing | Apps needing transcripts |
Predictable outputs | Customer-facing scripted flows |
This is easier to debug and a great starting point if you’re new to voice agents.
We set up a VoicePipeline with a custom workflow. This workflow runs an Agent, but it can also trigger special responses if you say a secret word.
Here’s what happens when you speak:
It’s real-time, interactive, and smart enough to react differently if you slip in a hidden phrase.
When setting up a voice pipeline, there are a few key components you can customize:
To kick off a voice pipeline, you’ll use the run()
method. It accepts audio input in one of two forms, depending on how you’re handling speech:
Once your pipeline is running, it returns a StreamedAudioResult, which lets you stream events in real time as the interaction unfolds. These events come in a few flavors:
Here’s a cleaner, well-structured version of your guide for setting up a Hands-on Voice Agent using OpenAI Agent SDK, with detailed steps, grouping, and clarity improvements. It’s all still casual and practical, but more readable and actionable:
mkdir my_project
cd my_project
Create the environment:
python -m venv .venv
Activate it:
source .venv/bin/activate
pip install openai-agent
export OPENAI_API_KEY=sk-...
git clone https://github.com/openai/openai-agents-python.git
Navigate to the example file:
cd openai-agents-python/examples/voice/static
Now, edit main.py:
You’ll do two key things:
Replace the entire content in main.py
. This is the final code here:
import asyncio
import random
from agents import Agent, function_tool
from agents.extensions.handoff_prompt import prompt_with_handoff_instructions
from agents.voice import (
AudioInput,
SingleAgentVoiceWorkflow,
SingleAgentWorkflowCallbacks,
VoicePipeline,
)
from .util import AudioPlayer, record_audio
@function_tool
def get_weather(city: str) -> str:
print(f"[debug] get_weather called with city: {city}")
choices = ["sunny", "cloudy", "rainy", "snowy"]
return f"The weather in {city} is {random.choice(choices)}."
spanish_agent = Agent(
name="Spanish",
handoff_description="A spanish speaking agent.",
instructions=prompt_with_handoff_instructions(
"You're speaking to a human, so be polite and concise. Speak in Spanish.",
),
model="gpt-4o-mini",
)
hindi_agent = Agent(
name="Hindi",
handoff_description="A hindi speaking agent.",
instructions=prompt_with_handoff_instructions(
"You're speaking to a human, so be polite and concise. Speak in Hindi.",
),
model="gpt-4o-mini",
)
agent = Agent(
name="Assistant",
instructions=prompt_with_handoff_instructions(
"You're speaking to a human, so be polite and concise. If the user speaks in Spanish, handoff to the spanish agent. If the user speaks in Hindi, handoff to the hindi agent.",
),
model="gpt-4o-mini",
handoffs=[spanish_agent, hindi_agent],
tools=[get_weather],
)
class WorkflowCallbacks(SingleAgentWorkflowCallbacks):
def on_run(self, workflow: SingleAgentVoiceWorkflow, transcription: str) -> None:
print(f"[debug] on_run called with transcription: {transcription}")
async def main():
pipeline = VoicePipeline(
workflow=SingleAgentVoiceWorkflow(agent, callbacks=WorkflowCallbacks())
)
audio_input = AudioInput(buffer=record_audio())
result = await pipeline.run(audio_input)
# Create a list to store all audio chunks
all_audio_chunks = []
with AudioPlayer() as player:
async for event in result.stream():
if event.type == "voice_stream_event_audio":
audio_data = event.data
player.add_audio(audio_data)
all_audio_chunks.append(audio_data)
print("Received audio")
elif event.type == "voice_stream_event_lifecycle":
print(f"Received lifecycle event: {event.event}")
# Save the combined audio to a file
if all_audio_chunks:
import wave
import os
import time
os.makedirs("output", exist_ok=True)
filename = f"output/response_{int(time.time())}.wav"
with wave.open(filename, "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(16000)
wf.writeframes(b''.join(all_audio_chunks))
print(f"Audio saved to {filename}")
if __name__ == "__main__":
asyncio.run(main())
Make sure you’re in the correct directory:
cd openai-agents-python
Then, launch it:
python -m examples.voice.static.main
I asked the agent two things, one in English and one in Hindi:
Here’s the terminal:
English Response:
Hindi Response:
Want to dig deeper? Check these out:
Also read: OpenAI’s Audio Models: How to Access, Features, Applications, and More
Building a voice agent with the OpenAI Agent SDK is way more accessible now—you don’t need to stitch together a ton of tools anymore. Just pick the right architecture, set up your VoicePipeline, and let the SDK do the heavy lifting.
If you’re going for high-quality conversational flow, go multimodal. If you want structure and control, go chained. Either way, this tech is powerful, and it’s only going to get better. If you are creating one, let me know in the comment section below.