GPT-4 passes Turing test and humans surprisingly often mistake other humans for AI

Midjourney prompted by THE DECODER

In a new study, human participants were unable to reliably distinguish whether they were chatting with a human or GPT-4. The results of the study raise new questions about the classic Turing test and its implications.

74 years after Alan Turing proposed his famous thought experiment comparing humans and machines, researchers have now conducted a randomized, controlled two-player variant of the Turing test with GPT-4. The result: In chats, human participants couldn't reliably recognize whether they were communicating with a human or the AI.

Before the test, the language models were set up with a detailed prompt to influence the responses, restricting typical behaviors and making the language models occasionally make spelling mistakes, for example.

A selection of interviews between human interviewers (green) and witnesses (gray). One of the four interviews was conducted with a human witness, the rest with an AI. Who is who? See at the end of the text. | Image: Jones, Bergen et al.

GPT-4 was judged to be a human 54 percent - meaning half of the participants thought the model was human after a five-minute conversation. This made GPT-4 perform better than the older GPT-3.5 model (50 percent) and the simple, rule-based ELIZA reference system from the 1960s (22 percent). Only human participants achieved a higher success rate at 67 percent - so participants rated them as AI in just under a third of cases. Modern AI systems are thus able to credibly deceive humans. And: when humans know that AI can behave in a human-like manner, they will set their "human" bar so high that they will repeatedly exclude real humans.

Intelligence or not - language models can be deceptive

The researchers defined 50 percent as success on the Turing test, since participants then couldn't distinguish between human and machine better than chance. By this definition, GPT-4 passed the test. Participants didn't simply guess: When they thought GPT-4 was human, they did so with an average confidence of 73 percent.

The Turing test is often described as a benchmark for machine intelligence. There was a lot of criticism of this: The test is too easy or too difficult, it doesn't really measure intelligence. The study results now provide empirical evidence of what the Turing test probably actually measures: Participants' strategies and rationales focused more on language style and socio-emotional factors than on knowledge and logic.

However, participants who asked about logic, current events, or human emotions and experiences were right more often on average.

"The results here likely set a lower bound on the potential for deception in more naturalistic contexts where, unlike the experimental setting, people may not be alert to the possibility of deception or exclusively focused on detecting it," the paper states. Systems that can reliably mimic humans could have far-reaching economic and social impacts, for example, by taking over customer contacts previously reserved for human employees.

But they could also mislead the public or their own human operators and undermine social trust in authentic human interactions, the scientists say.

Recommendation

AI in practice

Anthropic releases Claude 4 with new safety measures targeting CBRN misuse

The solution: A) is GPT-4, B) is human, C) is GPT-3.5, D) is ELIZA.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

GPT-4 passes Turing test and humans surprisingly often mistake other humans for AI

Intelligence or not - language models can be deceptive

Anthropic releases Claude 4 with new safety measures targeting CBRN misuse

Prompt engineering: Longer "Chain of Thoughts" improves language model performance

The dark secret of LLMs: Task contamination could be fooling us, researchers say

Coscientist uses GPT-4 for automated chemistry lab experiments

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

GPT-4 passes Turing test and humans surprisingly often mistake other humans for AI

Intelligence or not - language models can be deceptive

Share

Bank details