AI fact-checking is more accurate and 20 times cheaper than human effort, study finds

Researchers at the University of California, Berkeley and Google DeepMind have developed a method that demonstrates AI language models with access to search engines provide more accurate answers than human annotators.

The researchers used Google DeepMind's Search-Augmented Factuality Evaluator (SAFE) tool to assess the factual accuracy of the responses. SAFE uses an AI agent to break down text responses into individual facts, check their relevance, and verify the relevant facts through Google searches, allowing it to assess the accuracy of each factual claim.

The LLM first breaks down the long answer into individual, self-contained facts. For each fact, the LLM determines whether it is relevant to answering the question and performs a Google search to verify it. | Image: Wei et al.

For the study, GPT-4 generated the publicly available "LongFact" dataset, containing 2,280 questions on 38 topics, which contains 2,280 questions on 38 topics and serves as the basis for evaluating the factual accuracy of long answers provided by large language models (LLMs).

A potential weakness of the system is that LongFact and SAFE depend on the capabilities of the language models used. If these models have weaknesses in following instructions or reasoning, this will affect the quality of the questions and scores generated. In addition, fact-checking depends on the capabilities and accesses of Google search.

Language models with internet access hallucinate less than humans

The researchers compared SAFE's ratings for 16,011 individual facts with the ratings of human annotators from an earlier dataset.

They found that SAFE provided the same rating as human annotators for 72 percent of the facts, suggesting comparable performance in most cases. Moreover, in 100 instances where SAFE and humans disagreed, SAFE was correct in its assessment 76 percent of the time, while human annotators were only correct in 19 percent of cases, demonstrating SAFE's fourfold superiority in these situations.

SAFE and humans agree about three-quarters of the time. However, when they disagree, SAFE is much more likely to be correct than humans (76% vs. 19%). Humans and machines were wrong about five out of 100 facts. | Image: Wei et al.

According to the researchers, when the AI model fails, it is primarily due to incorrect reasoning - since only GPT-3.5 was used here, there is still plenty of room for improvement.

In addition to its already superior performance, SAFE was more than 20 times cheaper than human annotators ($0.19 per answer vs. $4 per answer). The researchers attribute AI's advantage to its ability to systematically retrieve and analyze vast amounts of information from the Web, while humans often rely on memory or subjective judgments, leading to more hallucinations.

According to the researchers, the SAFE system is better than humans at drawing logical conclusions from facts thanks to Google searches. | Image: Wei et al.

The study examined 13 language models from four model families (Gemini, GPT, Claude, and PaLM-2), with larger language models generally achieving better fact fidelity for long responses. GPT-4-Turbo, Gemini-Ultra and PaLM-2-L-IT-RLHF performed best.

Recommendation

AI in practice

OpenAI launches new reasoning model o3-mini for free ChatGPT and API

The newly introduced "F1@K" measure considers both accuracy and comprehensiveness of answers, allowing for a standardized comparison of different language models.

The performance of the language models sorted by factual accuracy. | Image: Wei et al.

The results have significant implications for the use of LLMs in real-world applications, particularly where factual accuracy is crucial. They demonstrate that LLMs with Internet access can be an effective tool for automated fact-checking.

The research findings could also indicate how Google aims to enhance the reliability of generative text AI, especially in the context of search. The Google Check button in the Gemini chatbot, which supports statements generated by the LLM with internet sources, is already a step in this direction. SAFE goes further by incorporating integrated reasoning.

Image: Screenshot Google Gemini Advanced

The research may also explain OpenAI's interest in more closely integrating LLM and internet search into a single product. The complete results and code are available on GitHub.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI fact-checking is more accurate and 20 times cheaper than human effort, study finds

Language models with internet access hallucinate less than humans

OpenAI launches new reasoning model o3-mini for free ChatGPT and API

Perplexity's valuation soared to $18 billion after its latest funding round

OpenAI CEO Sam Altman warns users not to trust ChatGPT agent with sensitive or personal data

Anthropic appears to tighten the usage limits for Claude code

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

AI fact-checking is more accurate and 20 times cheaper than human effort, study finds

Language models with internet access hallucinate less than humans

Share

Bank details