OpenAI releases new AI voice models with customizable speaking styles

OpenAI has released a new generation of audio models that let developers customize how their AI assistants speak. The update includes improved speech recognition and the ability to control an AI's speaking style through simple text commands.

According to OpenAI, their new gpt-4o-transcribe and gpt-4o-mini-transcribe models show lower error rates than previous Whisper systems when converting speech to text. The company says these models perform better in challenging conditions, such as heavy accents, noisy environments, and varying speech speeds.

The most notable feature comes from the new gpt-4o-mini-tts text-to-speech model. The system responds to style instructions like "speak like a pirate" or "tell this as a bedtime story," allowing developers to fine-tune how their AI voices communicate. These capabilities are built on OpenAI's GPT-4o and GPT-4o-mini architectures, which handle multiple types of media input and output.

According to OpenAI, the improved performance is due to specialized pre-training of audio datasets for more nuanced speech understanding, more efficient model distillation techniques, and expanded use of reinforcement learning in speech recognition. The company implemented "self-play" methods to simulate natural conversation patterns.

Developer access and limitations

Developers can now access these models through OpenAI's API and integrate them using the Agents SDK. For real-time applications, OpenAI suggests using their Realtime API with speech-to-speech capabilities.

For now, the system only works with OpenAI's preset artificial voices - developers can't create new voices or clone existing ones. The company says it plans to allow custom voices in the future while maintaining safety standards, and aims to expand into video for multimodal experiences.

This update follows OpenAI's March 2024 introduction of Voice Engine, which was limited to their own products and select customers. That earlier model appears to have been replaced by GPT-4o's broader multimodal capabilities.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI releases new AI voice models with customizable speaking styles

Developer access and limitations

Yes, generative AI for audio can (and will) hallucinate just like other generative AI systems

Creative stories can jailbreak ChatGPT voice, study finds

AI voice clones without regulation pose more social risks than benefits

OpenAI suddenly remembers that copyright law exists after a few days of wild Sora videos

OpenAI unveils Sora 2 video model with realistic physics, high-quality audio, and a new social app

Deepmind says video models for visual tasks could become what LLMs are for text tasks

OpenAI releases new AI voice models with customizable speaking styles

Developer access and limitations

Yes, generative AI for audio can (and will) hallucinate just like other generative AI systems

Creative stories can jailbreak ChatGPT voice, study finds

AI voice clones without regulation pose more social risks than benefits