OpenAI has released a new generation of audio models that let developers customize how their AI assistants speak. The update includes improved speech recognition and the ability to control an AI's speaking style through simple text commands.
According to OpenAI, their new gpt-4o-transcribe and gpt-4o-mini-transcribe models show lower error rates than previous Whisper systems when converting speech to text. The company says these models perform better in challenging conditions, such as heavy accents, noisy environments, and varying speech speeds.
The most notable feature comes from the new gpt-4o-mini-tts text-to-speech model. The system responds to style instructions like "speak like a pirate" or "tell this as a bedtime story," allowing developers to fine-tune how their AI voices communicate. These capabilities are built on OpenAI's GPT-4o and GPT-4o-mini architectures, which handle multiple types of media input and output.
According to OpenAI, the improved performance is due to specialized pre-training of audio datasets for more nuanced speech understanding, more efficient model distillation techniques, and expanded use of reinforcement learning in speech recognition. The company implemented "self-play" methods to simulate natural conversation patterns.
Developer access and limitations
Developers can now access these models through OpenAI's API and integrate them using the Agents SDK. For real-time applications, OpenAI suggests using their Realtime API with speech-to-speech capabilities.
For now, the system only works with OpenAI's preset artificial voices - developers can't create new voices or clone existing ones. The company says it plans to allow custom voices in the future while maintaining safety standards, and aims to expand into video for multimodal experiences.
This update follows OpenAI's March 2024 introduction of Voice Engine, which was limited to their own products and select customers. That earlier model appears to have been replaced by GPT-4o's broader multimodal capabilities.