A study from Cornell University shows that OpenAI's Whisper AI speech recognition system invents content that doesn't exist in about 1.4 percent of audio recordings.
While 1.4 percent might seem small, the potential impact grows significantly when considering the software's widespread use. When Whisper or similar software is used millions of times, these errors add up quickly.
And it's not just a matter of quantity—the quality of those fabrications is troubling, too. The researchers found that 38 percent of the made-up content contained problematic material, from depictions of violence to incorrect attributions and misleading claims of authority.
The trouble with pauses
The study points to longer pauses in speech as a key culprit. When there's a gap in the audio, Whisper tries to fill in the blanks based on its general language knowledge—and that's where things can go awry.
This problem affects people with speech disorders like aphasia more than others, as they tend to pause more often; the error rate jumped to 1.7 percent, compared to 1.2 percent for the control group. It's a reminder that AI's biases and blind spots can have very real consequences for marginalized communities.
Other researchers have documented similar problems, according to the AP. A University of Michigan researcher found fabricated content in eight out of 10 transcripts, while a machine learning engineer found errors in about half of the more than 100 hours of recordings he analyzed. Another developer reported finding errors in nearly all of his 26,000 transcripts, the AP reports.
OpenAI acknowledges these limitations and advises against using Whisper in "high-risk domains like decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes."
The latest Whisper v3 model also suffers from hallucinations. OpenAI believes that these occur "because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself."
Hallucinations are a universal generative AI feature
The fact that audio-based AI tools like Whisper hallucinate just as much as yours truly ChatGPT is well documented, but apparently not everyone is aware of it, as evidenced by the recent hype around AI-generated podcasts.
Tools like NotebookLM's Audio Overviews can spin brief topics into lengthy discussions, creating even more room for error. And with natural-sounding AI voices, it's all too easy to take the output at face value.
This doesn't mean that AI podcasts are inherently bad or useless. They can be useful for creating educational content, for example, as long as the material can be thoroughly reviewed. But relying on them to learn new information without verification is a bad idea.
The key takeaway is that human oversight is critical for any type of AI-generated content, whether it's text, transcripts, or podcasts. We need experts who understand the subject to review and validate the output. Because with current technology, blindly trusting AI-generated content is a surefire way to let errors slip through the cracks, regardless of the format.