Content
summary Summary

Google DeepMind has introduced FACTS Grounding, a new benchmark that tests AI models' ability to provide accurate, document-based answers.

Ad

The benchmark uses 1,719 selected examples where AI models must generate detailed responses based on provided documents. The benchmark's unique feature is its evaluation method: three leading AI models—Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet—serve as judges.

These models evaluate responses on two key criteria: whether the answer adequately addresses the query, and whether it's factually correct and fully supported by the source document.

The test documents span various fields, including finance, technology, retail, medicine, and law. These documents can be up to 32,000 tokens (approximately 20,000 words) in length. The tasks include summaries, question-answering, and rephrasing exercises. Human evaluators created and verified these tasks to ensure they don't require creative responses, expert knowledge, or mathematical understanding.

Ad
Ad

To calculate final scores, the benchmark combines results from different scoring models for each answer. The overall task score represents the average of all scoring model results across all examples. Google DeepMind hosts a FACTS Leaderboard on Kaggle.

Tabular overview: Comparison of 9 AI models in the FACTS Grounding Benchmark with accuracy values and confidence intervals.
In the FACTS Grounding Benchmark, Google's Gemini models achieve the highest scores for factually accurate text generation. | Image: Google Deepmind

Preventing gaming the system

Google DeepMind says it will continue developing the benchmark. "Factuality and grounding are among the key factors that will shape the future success and usefulness of LLMs and broader AI systems," the company writes.

To protect against manipulation, Google DeepMind split the benchmark into two parts: 860 public examples available now and 859 examples kept private. The final score combines results from both sets.

Google DeepMind acknowledges that while large language models are changing how people access information, their factual accuracy control remains imperfect. Complex inputs can still lead to hallucinations, potentially undermining trust in LLMs and limiting their practical applications.

FACTS Grounding takes a different approach from other tests like OpenAI's SimpleQA. While SimpleQA tests models with 4,326 knowledge questions they must answer from training data, FACTS Grounding evaluates how well models process new information from provided documents.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google DeepMind has launched FACTS Grounding, a new benchmark designed to assess the ability of AI models to provide accurate and comprehensive answers based on given text. The benchmark consists of 1,719 curated examples across a range of domains.
  • The benchmark uses a unique evaluation approach in which three leading AI models serve as judges, evaluating answers based on two key criteria: whether the question is adequately addressed, and whether the answer is factually accurate and fully grounded in the given document.
  • The FACTS Grounding Benchmark is intended to be a work in progress, with the goal of improving the reliability of language models and expanding their applicability to a wider range of use cases.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.