Content
summary Summary

A University College London study found that humans can detect deepfake speech only 73% of the time, with equal accuracy for English and Mandarin speakers.

Using a text-to-speech algorithm, the researchers generated 50 deepfake speech samples in each language and played them to 529 participants.

About 27 percent of the time, listeners thought the deepfake speech was real. A familiarization treatment only increased recognition accuracy by an average of 3.84 percent, and listening to the clips several times or listening to shorter clips didn't help either.

This means, for example, that one in four phone scams could be successful. Even though there are other factors at play, such as hearing a voice you already know, which is probably easier to recognize as fake (but even more dangerous if you don't).

Ad
Ad

Recognizing deepfake speech will only get harder

However, the researchers expect that in the future, deepfake speech will improve and become more realistic, making it even harder to detect. They didn't even use the latest technology for their study.

"The difficulty of detecting speech deepfakes confirms their potential for misuse and signals that defenses against this threat are needed."

The study, with the very direct title "Warning: Humans cannot reliably detect speech deepfakes," raises concerns about the ability to consistently detect deepfake audio, even with training: "Our results suggest the need for automated detectors to mitigate a human listener's weaknesses."

While automated deepfake detectors have limitations, improving these systems is essential to mitigate the potential threats posed by deepfake content, says lead author Kimberly Mai. Compared to deepfake videos, deepfake audio has even fewer clues to identify it as fake.

"We show that even in a controlled environment where the task is easier (participants are aware of the presence of speech deepfakes and the deepfakes are not created using state-of-the-art speech synthesizers), deepfake detection is not high."

One interesting thing the researchers found was that people who correctly classified real utterances as legitimate and those who incorrectly classified them as fake tended to emphasize the same features, such as pauses, tones, and so on. Thus, people's gut feelings or intuition played a key role in their decision-making process, often using words like "naturalness" or "robotic" to explain their choices.

In terms of overall performance, the combination of multiple human judgments, referred to in the study as crowd performance, was on par with the best automated detectors and was less likely to fail when conditions changed.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A University College London study shows that humans can detect deepfake speech accurately only 73% of the time, with no significant difference between English and Mandarin speakers.
  • Familiarization and repeated listening had minimal impact on participants' detection accuracy, raising concerns about the potential for misuse of deepfake speech in scams and misinformation.
  • The researchers emphasize the need for improved automated deepfake detectors to mitigate the increasing difficulty of detecting speech manipulation, as human intuition and judgment remain insufficient.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.