LLM comparison: GPT-4, Claude 2 and Llama 2 - which is hallucinating, which is hedging?

Midjourney prompted by THE DECODER

The LLM benchmarking service "Arthur" compared the performance of large language models such as GPT-4 for two important topics.

Arthur analyzed the hallucinations and response relativizations of OpenAI's GPT-3.5 (~175 billion parameters) and GPT-4 (~1.76 trillion parameters) language models, Anthropic's Claude 2 (parameters unknown), Meta's Llama 2 (70 billion parameters), and Cohere's Command model (~50 billion parameters).

To compare the hallucinations, Arthur asked questions about combinatorics and probability, U.S. presidents, and political leaders in Morocco. The questions were asked several times because the LLMs sometimes gave the right answer, sometimes a slightly wrong answer, and sometimes an entirely wrong answer to the same question.

Claude 2 had the fewest hallucinations and more correct answers to questions about U.S. presidents, performing better than GPT-4 and significantly better than GPT-3.5 Turbo, which consistently failed. The latter is critical because the free ChatGPT is based on GPT-3.5 and is probably the most widely used by students and in schools.

Meta's Llama 2 and Claude 2 were particularly likely to refuse to answer about Moroccan politicians, likely as a countermeasure against excessive hallucinations. GPT-4 was the only model with more correct answers than hallucinations in this test.

GPT-4 is more cautious than other models

In a second test, the benchmarking platform looked at the extent to which models hedge their answers, that is, preface their answers with a caveat such as "As a large language model, I cannot …". This "hedging" of answers can frustrate users and is sometimes found in AI-generated texts by careless "authors.

For the hedging test, the platform used a dataset of generic questions that users might ask LLMs. The two GPT-4 models used hedging 3.3 and 2.9 percent of the time, respectively. GPT-3.5 turbo and Claude 2 did so only about two percent of the time, while Cohere did not use this mechanism.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

LLM comparison: GPT-4, Claude 2 and Llama 2 - which is hallucinating, which is hedging?

GPT-4 is more cautious than other models

Perplexity's valuation soared to $18 billion after its latest funding round

OpenAI CEO Sam Altman warns users not to trust ChatGPT agent with sensitive or personal data

Anthropic appears to tighten the usage limits for Claude code

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

LLM comparison: GPT-4, Claude 2 and Llama 2 - which is hallucinating, which is hedging?

GPT-4 is more cautious than other models

Perplexity's valuation soared to $18 billion after its latest funding round

OpenAI CEO Sam Altman warns users not to trust ChatGPT agent with sensitive or personal data

Anthropic appears to tighten the usage limits for Claude code