Content
summary Summary

The LLM benchmarking service "Arthur" compared the performance of large language models such as GPT-4 for two important topics.

Arthur analyzed the hallucinations and response relativizations of OpenAI's GPT-3.5 (~175 billion parameters) and GPT-4 (~1.76 trillion parameters) language models, Anthropic's Claude 2 (parameters unknown), Meta's Llama 2 (70 billion parameters), and Cohere's Command model (~50 billion parameters).

To compare the hallucinations, Arthur asked questions about combinatorics and probability, U.S. presidents, and political leaders in Morocco. The questions were asked several times because the LLMs sometimes gave the right answer, sometimes a slightly wrong answer, and sometimes an entirely wrong answer to the same question.

Image: arthur.ai

Claude 2 had the fewest hallucinations and more correct answers to questions about U.S. presidents, performing better than GPT-4 and significantly better than GPT-3.5 Turbo, which consistently failed. The latter is critical because the free ChatGPT is based on GPT-3.5 and is probably the most widely used by students and in schools.

Ad
Ad
Image: arthur.ai

Meta's Llama 2 and Claude 2 were particularly likely to refuse to answer about Moroccan politicians, likely as a countermeasure against excessive hallucinations. GPT-4 was the only model with more correct answers than hallucinations in this test.

Image: arthur.ai

GPT-4 is more cautious than other models

In a second test, the benchmarking platform looked at the extent to which models hedge their answers, that is, preface their answers with a caveat such as "As a large language model, I cannot …". This "hedging" of answers can frustrate users and is sometimes found in AI-generated texts by careless "authors.

For the hedging test, the platform used a dataset of generic questions that users might ask LLMs. The two GPT-4 models used hedging 3.3 and 2.9 percent of the time, respectively. GPT-3.5 turbo and Claude 2 did so only about two percent of the time, while Cohere did not use this mechanism.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
GPT-4 models like to back up their answers with a relativizing introduction. | Image: arthur.ai
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Arthur, an LLM benchmarking service, compared the performance of large language models such as GPT-3.5, GPT-4, Llama-2, Claude-2, and Cohere's Command model on hallucinations and answer "hedging".
  • Claude-2 performed best on questions about U.S. presidents, with the fewest hallucinations and more correct answers than GPT-4, while GPT-4 was most accurate on questions about Moroccan politicians.
  • GPT-3.5, the model behind the free ChatGPT, hallucinated a lot. GPT-4 is more careful in choosing answers and gives warnings slightly more often than other LLMs.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.