A new benchmark from Google Deepmind aims to measure AI model reliability more comprehensively than ever before. The results reveal that even top-tier models like Gemini 3 Pro and GPT-5.1 are far from perfect.
Researchers at Google Deepmind have introduced the FACTS Benchmark, a testing environment designed to evaluate the factual accuracy of large language models (LLMs) across multiple disciplines. The benchmark aggregates performance in four specific categories: visual understanding, internal knowledge, web search, and text-based evidence.
Deepmind argue that previous tests often evaluated isolated skills, failing to capture the bigger picture. A model might be excellent at summarizing documents, for example, but fail completely when retrieving facts from memory.
In the overall rankings, Google's own Gemini 3 Pro model took first place with a score of 68.8, followed by Gemini 2.5 Pro (62.1) and OpenAI's GPT-5 (61.8).
Kaggle hosts the leaderboard
To ensure the benchmark's integrity and long-term viability, the FACTS Leaderboard is hosted on the data science platform Kaggle. Developers can submit their models directly to the platform for automatic evaluation.
Kaggle divides the test data into public and private sets ("splits"). Only a portion of the prompts are visible to the public, while the rest remain secret to prevent developers from optimizing models specifically for the test questions. Kaggle handles the actual evaluation of all submitted models.
Four pillars of truth
The benchmark splits into four sub-tests to cover different usage scenarios:
- FACTS Multimodal: Models must answer questions about images. The test assesses whether the answer covers all essential facts ("coverage") and avoids contradicting the image or general world knowledge ("no-contradiction").
- FACTS Parametric: This test checks a model's internal knowledge ("closed book") without access to external tools. The questions rely on Wikipedia facts but use an "adversarial sampling" procedure to filter for questions that are difficult for simple models to solve.
- FACTS Search: This assesses the ability to generate correct answers using a search engine (the Brave Search API). This simulates information searches on topics absent from training data or those requiring specific details.
- FACTS Grounding (v2): Building on its predecessor, this test measures how well a model generates answers based solely on a provided long document, without adding external information.
The results show clear discrepancies between disciplines. While Gemini 3 Pro dominates in "Search" (83.8 percent) and "Parametric" (76.4 percent), it drops to 46.1 percent in the "Multimodal" category. GPT-5 shows similar fluctuations: it is strong in search (77.7 percent) but significantly weaker in internal fact retrieval (55.8 percent). In our experience, however, GPT-5.1 with Thinking currently performs best for complex search queries.
To prevent gaming the system, the benchmark includes both public and private datasets. Other AI models, primarily Gemini 2.5 Flash and GPT-5, automatically evaluate the answers. To avoid bias, the researchers use the average score from various judge models for the final rating.
Strategic silence beats hallucination
One interesting detail from the study is how models handle uncertainty in the parametric test. The researchers distinguish between pure accuracy and "attempted accuracy."
The GPT-5 model has a hedging rate (refusal rate) of 13.3 percent, meaning it frequently refuses to answer uncertain questions. By contrast, the GPT-o3 model almost always answers (only 1.9 percent refusal) but is hardly more accurate in absolute terms. Thanks to this strategic silence, GPT-5 achieves a higher attempted accuracy (64.3 percent) than GPT-o3 (58.2 percent).
The recently published "Omniscience Index," a similar benchmark for AI reliability, shows that this distinction matters in the real world. Gemini 3 Pro also took first place there, but the data revealed a critical flaw: when the model couldn't provide an answer, it hallucinated one 88 percent of the time instead of admitting ignorance.
Because the Omniscience benchmark penalized incorrect answers severely, only four out of 40 models achieved a positive score. Deepmind's results confirm Gemini 3 Pro's lead but show—through the nuanced look at "hedging"—that models like GPT-5 sometimes act more cautiously than their competitors. Both analyses conclude that massive model size and general intelligence do not automatically protect against factual errors.