A University College London study found that humans can detect deepfake speech only 73% of the time, with equal accuracy for English and Mandarin speakers.
Using a text-to-speech algorithm, the researchers generated 50 deepfake speech samples in each language and played them to 529 participants.
About 27 percent of the time, listeners thought the deepfake speech was real. A familiarization treatment only increased recognition accuracy by an average of 3.84 percent, and listening to the clips several times or listening to shorter clips didn't help either.
This means, for example, that one in four phone scams could be successful. Even though there are other factors at play, such as hearing a voice you already know, which is probably easier to recognize as fake (but even more dangerous if you don't).
Recognizing deepfake speech will only get harder
However, the researchers expect that in the future, deepfake speech will improve and become more realistic, making it even harder to detect. They didn't even use the latest technology for their study.
"The difficulty of detecting speech deepfakes confirms their potential for misuse and signals that defenses against this threat are needed."
The study, with the very direct title "Warning: Humans cannot reliably detect speech deepfakes," raises concerns about the ability to consistently detect deepfake audio, even with training: "Our results suggest the need for automated detectors to mitigate a human listener's weaknesses."
While automated deepfake detectors have limitations, improving these systems is essential to mitigate the potential threats posed by deepfake content, says lead author Kimberly Mai. Compared to deepfake videos, deepfake audio has even fewer clues to identify it as fake.
"We show that even in a controlled environment where the task is easier (participants are aware of the presence of speech deepfakes and the deepfakes are not created using state-of-the-art speech synthesizers), deepfake detection is not high."
One interesting thing the researchers found was that people who correctly classified real utterances as legitimate and those who incorrectly classified them as fake tended to emphasize the same features, such as pauses, tones, and so on. Thus, people's gut feelings or intuition played a key role in their decision-making process, often using words like "naturalness" or "robotic" to explain their choices.
In terms of overall performance, the combination of multiple human judgments, referred to in the study as crowd performance, was on par with the best automated detectors and was less likely to fail when conditions changed.