Content
summary Summary

OpenAI has launched its "realtime API" for production, moving it out of beta.

Ad

The API targets companies and developers building voice assistants for real-world applications like customer support, education, or personal productivity. Its main component, the "gpt-realtime" model, generates and processes speech directly, skipping the usual text conversion. It responds faster, sounds more natural, and handles complex instructions better than previous versions, according to OpenAI.

The company says gpt-realtime can now pick up on nonverbal cues like laughter, switch languages mid-sentence, and adjust its tone - for example, speaking "friendly with a French accent" or "fast and professional." The model also features two new voices, Cedar and Marin, along with improvements to the existing voices.

Video: OpenAI

Ad
Ad

On benchmarks, gpt-realtime reaches 82.8 percent accuracy on Big Bench Audio (up from 65.6 percent), 30.5 percent on MultiChallenge (up from 20.6 percent), and 66.5 percent on ComplexFuncBench (up from 49.7 percent).

Better tool integration and image input

The API now streamlines tool integration. OpenAI says the model is better at picking the right tool, triggering it at the right moment, and using the right arguments, making function calls more dependable. Developers can connect external tools and services through SIP and remote MCP servers. Reusable prompts allow for saving configurations and tool settings for different use cases.

Image input is now supported. Users can send screenshots or photos in a conversation, and the model can reference them—for example, to read text from an image or answer questions about what's shown. Developers control what the model can see.

New options let developers set token limits and trim multi-turn conversations, which helps control costs for longer sessions. Pricing for gpt-realtime is now 20 percent lower: $32 per million audio input tokens and $64 per million output tokens. Cached input tokens are $0.40 per million.

OpenAI says the API can detect problematic content and end conversations that break its policies, but the history of language model security suggests this shouldn't be the only safeguard. Developers can add their own safety requirements. For EU users, there are options for storing data within the EU and special privacy rules for businesses.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI has officially launched its "realtime API", aimed at enabling companies and developers to integrate voice assistants into practical, everyday applications.
  • The new "gpt-realtime" model can process and generate speech instantly, responds more quickly, produces more natural-sounding output, and can recognize both non-verbal signals and language changes, outperforming the earlier beta version.
  • The API now includes support for image input, better tool integration through asynchronous calls, and new control options such as token limits and the ability to truncate multiple sessions in parallel.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.