Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Nov 19, 2025

Sora prompted by THE DECODER

Key Points

A new benchmark from Artificial Analysis found that most large language models perform poorly in factual reliability, with only four out of 40 achieving a positive score; Google’s Gemini 3 Pro leads with 13 points, well ahead of Claude 4.1 Opus, GPT‑5.1, and Grok 4.
Despite its top accuracy of 53 percent, Gemini 3 Pro showed a high 88 percent hallucination rate, meaning it often produces false answers with confidence instead of admitting uncertainty—an issue shared by other major models like GPT‑5.1 and Grok 4.
The Omniscience Index penalizes incorrect answers rather than rewarding guesses and reveals that while model size correlates with accuracy, it does not reduce hallucinations; smaller models such as Nvidia’s Nemotron Nano 9B V2 and Llama Nemotron Super 49B v1.5 outperformed many larger systems.

A new benchmark from Artificial Analysis reveals alarming weaknesses in the factual reliability of large language models. Out of 40 models tested, only four achieved a positive score - with Google's Gemini 3 Pro clearly in the lead.

Gemini 3 Pro scored 13 points on the new Omniscience Index, which ranges from -100 to 100, substantially ahead of Claude 4.1 Opus (4.8), GPT-5.1, and Grok 4. The high score mainly reflects the model's strong accuracy. Gemini 3 Pro outperformed Grok 4, the previously most accurate model, by 14 points. A score of 0 means a model answers questions correctly and incorrectly at the same rate. The AA-Omniscience Benchmark measures how reliably AI models retrieve factual knowledge across different subject areas.

Gemini 3 Pro leads by a clear margin, followed by Claude 4.1 Opus, GPT-5.1.1 and Grok 4. All other models tested are in the negative range, but even the frontrunners do not perform really well.

According to Artificial Analysis, Gemini 3 Pro’s lead is mainly driven by its increased accuracy - 14 points higher than Grok 4, the prior record-holder. The researchers interpret this as evidence of the model's large scale since accuracy in the benchmark strongly correlates with model size.

Google's Gemini 3 Pro shows a significant increase in accuracy compared to Grok 4 and its direct predecessor.

Hallucination rates remain the main weakness

The study found that poor results across the board stem largely from high hallucination rates. Gemini 3 Pro achieved the highest overall accuracy at 53 percent, far ahead of previous leaders like GPT‑5.1 (high) and Grok 4, both at 39 percent. But the model still showed an 88 percent hallucination rate, matching Gemini 2.5 Pro and Gemini 2.5 Flash.

GPT‑5.1 (high) and Grok 4 were also high at 81 and 64 percent respectively, but Gemini 3 Pro went even further. Artificial Analysis concluded that while Gemini 3 Pro demonstrates greater factual coverage, its tendency to give wrong answers rather than admit uncertainty remains unchanged.

Here, hallucination rate refers to the share of false responses among all incorrect attempts - meaning a high value indicates overconfidence, not ignorance.

Claude 4.1 Opus scored 36 percent accuracy with one of the lowest hallucination rates, giving it the top position before Gemini 3 Pro’s release.

Some of the LLMs tested vary greatly in their hallucination rate.

The AA-Omniscience benchmark covers 6,000 questions across 42 economically relevant topics in six domains: business, humanities and social sciences, health, law, software engineering, and science and math. The dataset draws from authoritative academic and industrial sources and was automatically generated by an AI agent.

Treemap der AA-Omniscience-Fragen nach sechs Domänen und 42 Kategorien, z. B. Medizin 700, Engineering 400. — The 6,000 questions are spread across six domains and 42 categories.

A new scoring system that penalizes guessing

Unlike typical benchmarks, the Omniscience Index penalizes wrong answers as much as it rewards correct ones. The researchers argue that current evaluation methods often encourage guessing, which increases hallucination behavior.

In contrast, the new metric rewards restraint. Models receive no points for admitting uncertainty, but they also aren’t penalized. Wrong answers, however, lead to large deductions.

Side-by-Side-Chat-Benutzeroberfläche: Das Modell mit geringem Wissensstand definiert MCP für Supabase falsch, während das Modell mit hohem Wissensstand MCP korrekt als Model Context Protocol erklärt. — The Model Context Protocol (MCP) was only introduced by Anthropic at the end of 2024 and is therefore probably rarely found in the training material of large language models.

The results group models into four categories: those with extensive knowledge and high reliability (like Claude 4.1 Opus), those with knowledge but low reliability (like Claude 4.5 Haiku), those with limited knowledge but consistent reliability (like GPT‑5.1), and finally, smaller models lacking both knowledge and reliability, such as OpenAI’s lightweight gpt‑oss.

No domain-specific breakdown was available for Gemini 3 Pro.

Streudiagramm der Omniscience Accuracy vs. Omniscience Index; Modelle im grünen Quadranten (z. B. Claude 4.1 Opus) bieten hohe Genauigkeit und Zuverlässigkeit. — The smallest version of OpenAI's open-weight release gpt-oss fails on both dimensions.

Older Llama model performs surprisingly well

General intelligence doesn’t necessarily translate into factual reliability. Models like Minimax M2 and gpt‑oss‑120b (high) perform strongly on the broader Artificial Analysis Intelligence Index, which aggregates results from multiple benchmarks, but do poorly on the Omniscience Index due to high hallucination rates.

Conversely, the older Llama‑3.1‑405B scored well on the Omniscience Index even though it typically ranks below newer frontier models in overall evaluations.

No single model demonstrated consistently strong factual reliability across all six domains. Claude 4.1 Opus led in law, software engineering, and the humanities; GPT‑5.1.1 ranked first in business questions; while Grok 4 performed best in health and science.

Heatmap der normalisierten Omniscience-Index-Werte für sechs Domänen und 24 Modelle, Grün=best, Rot=schlechtest. — Of the major commercial models, Google appears to perform the worst across all categories.

According to the study, these domain differences mean that relying solely on overall performance can obscure important gaps.

Bigger doesn’t always mean more reliable

While larger models tend to achieve higher accuracy, they don’t necessarily have lower hallucination rates. Several smaller models - like Nvidia’s Nemotron Nano 9B V2 and Llama Nemotron Super 49B v1.5 - outperformed much larger competitors on the Omniscience Index.

Artificial Analysis confirmed that accuracy strongly correlates with model size, but hallucination rate does not. That explains why Gemini 3 Pro, despite its high accuracy, still hallucinates frequently.

In terms of cost efficiency, Claude 4.5 Haiku stands out with a higher Omniscience score than several far more expensive models like GPT‑5.1 (high) and Kimi K2 Thinking.

The researchers have released 10 percent of the benchmark’s questions as a public dataset to support future research, while the majority remains private to prevent contamination of training data.

A related recent study uncovered structural flaws in existing AI benchmarks, citing vague definitions of key terms like "reasoning," unrepresentative sampling, and a lack of statistical validation across model comparisons.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv