Content
summary Summary

In a new study, human participants were unable to reliably distinguish whether they were chatting with a human or GPT-4. The results of the study raise new questions about the classic Turing test and its implications.

74 years after Alan Turing proposed his famous thought experiment comparing humans and machines, researchers have now conducted a randomized, controlled two-player variant of the Turing test with GPT-4. The result: In chats, human participants couldn't reliably recognize whether they were communicating with a human or the AI.

Before the test, the language models were set up with a detailed prompt to influence the responses, restricting typical behaviors and making the language models occasionally make spelling mistakes, for example.

A selection of interviews between human interviewers (green) and witnesses (gray). One of the four interviews was conducted with a human witness, the rest with an AI. Who is who? See at the end of the text. | Image: Jones, Bergen et al.

GPT-4 was judged to be a human 54 percent - meaning half of the participants thought the model was human after a five-minute conversation. This made GPT-4 perform better than the older GPT-3.5 model (50 percent) and the simple, rule-based ELIZA reference system from the 1960s (22 percent). Only human participants achieved a higher success rate at 67 percent - so participants rated them as AI in just under a third of cases. Modern AI systems are thus able to credibly deceive humans. And: when humans know that AI can behave in a human-like manner, they will set their "human" bar so high that they will repeatedly exclude real humans.

Ad
Ad

Intelligence or not - language models can be deceptive

The researchers defined 50 percent as success on the Turing test, since participants then couldn't distinguish between human and machine better than chance. By this definition, GPT-4 passed the test. Participants didn't simply guess: When they thought GPT-4 was human, they did so with an average confidence of 73 percent.

Image: Jones, Bergen et al.

The Turing test is often described as a benchmark for machine intelligence. There was a lot of criticism of this: The test is too easy or too difficult, it doesn't really measure intelligence. The study results now provide empirical evidence of what the Turing test probably actually measures: Participants' strategies and rationales focused more on language style and socio-emotional factors than on knowledge and logic.

However, participants who asked about logic, current events, or human emotions and experiences were right more often on average.

Image: Jones, Bergen et al.

"The results here likely set a lower bound on the potential for deception in more naturalistic contexts where, unlike the experimental setting, people may not be alert to the possibility of deception or exclusively focused on detecting it," the paper states. Systems that can reliably mimic humans could have far-reaching economic and social impacts, for example, by taking over customer contacts previously reserved for human employees.

But they could also mislead the public or their own human operators and undermine social trust in authentic human interactions, the scientists say.

Recommendation

The solution: A) is GPT-4, B) is human, C) is GPT-3.5, D) is ELIZA.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have revisited the Turing Test with GPT-4. Human participants in chats were unable to reliably identify whether they were communicating with a human or the AI.
  • GPT-4 was rated as human by 54 percent, meaning that half of the participants thought the model was human after a five-minute conversation. This was better than GPT-3.5 (50 percent) and the rule-based system ELIZA (22 percent), but worse than real people (67 percent).
  • The results suggest that people are more likely to be guided by linguistic style and social-emotional factors than by knowledge and logic in the Turing test. Systems that can reliably mimic humans could have far-reaching economic and social consequences, such as taking over customer interactions or misleading the public.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.