Ad
Skip to content

Google's AI Overviews are correct nine out of ten times, study finds

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • The AI start-up Oumi analyzed 4,326 Google searches on behalf of the New York Times and found that Google's AI Overviews answered correctly 85% of the time with Gemini 2 and 91% with Gemini 3.
  • At Google's scale, even a nine percent error rate translates to millions of wrong answers per hour. Google says the study has "serious holes."
  • Despite the improved accuracy, verifiability has gotten worse: with Gemini 3, 56 percent of correct answers could not be verified through the linked sources, up from 37 percent with Gemini 2.

Google puts a disclaimer under every AI-generated search response: "AI responses may include mistakes." But just how often those mistakes actually happen has remained largely unstudied.

On behalf of the New York Times, AI startup Oumi examined 4,326 Google searches using the industry-standard SimpleQA benchmark. The tests ran in two rounds: once in October with Gemini 2 powering the AI, and again in February after the upgrade to Gemini 3.

The findings: with Gemini 2, AI overviews were correct 85 percent of the time. With Gemini 3, that number climbed to 91 percent. That sounds impressive, but at Google's scale, it still means millions of wrong answers every hour.

What the study doesn't address is whether users would have gotten better answers through traditional search results or other sources. Not everything on websites is automatically correct either. The real question is whether users end up with more correct information overall than they would without Google's AI Overviews.

Ad
DEC_D_Incontent-1

Accuracy is up, but verifiability is down

Another key finding: while accuracy improved with Gemini 3, verifiability actually got worse. Oumi checked whether the sources Google linked actually supported the answers it gave. With Gemini 2, 37 percent of correct answers were "ungrounded," meaning the linked websites didn't fully back up the information. With Gemini 3, that figure jumped to 56 percent. Often, there's simply no way to verify an answer based on the source Google provides.

The quality of those sources is questionable too. Out of 5,380 sources Google cited, Facebook and Reddit ranked second and fourth most common. Facebook showed up as a source in five percent of correct answers and seven percent of incorrect ones. Google may have an incentive to favor sources that are less likely to sue over content use.

The New York Times highlights several examples of how things can go wrong even when the system locates the right source. In a question about the Classical Music Hall of Fame, Google identified the correct website listing Yo-Yo Ma as a member but still claimed there was no record of his induction.

When asked about the river west of Goldsboro, North Carolina, Google found the right tourism website but misread the information, naming the Neuse River instead of the actual Little River to the west.

Ad
DEC_D_Incontent-2

And for a question about the Bob Marley Museum, Google's AI Overview gave the wrong opening year—1987 instead of 1986—pulling from a Facebook post, a travel blog, and a Wikipedia page with conflicting information.

Google pushes back on the study's methods

To verify answers at scale, Oumi used its own AI verification model, HallOumi. That's the only practical way to check thousands of responses, but it comes with an obvious weakness: the AI doing the checking can make mistakes too. Moreover, AI overviews can generate different answers for identical searches, even when queries are just seconds apart.

Google spokesperson Ned Adriance called the study flawed, saying it has "serious holes." The SimpleQA benchmark itself contains incorrect information and doesn't reflect what people actually search for on Google, he said.

Despite its name, SimpleQA, developed by OpenAI, is built around particularly tricky questions, ones where at least one AI model failed during a pre-screening process. That means the failure rate is naturally higher. The benchmark is also designed for scenarios without internet access.

In the Artificial Analysis Intelligence Index, Google's latest model, Gemini 3.1 Pro, shows a 38 percentage point drop in hallucination rate compared to the earlier Gemini 3, which was likely running as a less capable Flash version in Google's search at the time of testing. Google says results with web search are more accurate than those based purely on model knowledge.

The real issue is what AI answers are doing to the open web

The bigger debate around Google's AI overviews is about what they're doing to the internet. By serving up direct answers instead of sending users to external websites, Google is cutting off traffic to publishers and undermining their economic foundation.

The open web is losing its role as a freely linked information network, increasingly replaced by a centralized AI interface under Google's control. A 90 percent accuracy rate is likely more than enough for most users and most searches to skip clicking through to the underlying website altogether.

Studies showing that AI overviews hurt web traffic have consistently been denied by Google, which has yet to share any numbers of its own. Even OpenAI was more upfront when it first launched web features for ChatGPT, stating that "we appreciate that this is a new method of interacting with the web, and welcome feedback on additional ways to drive traffic back to sources and add to the overall health of the ecosystem," though that concern quietly faded as its search rollout progressed.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.