Ad
Skip to content

AI search agents often confirm what they already know instead of actually researching the web

Image description
Nano Banana Pro prompted by THE DECODER

A new study suggests that leading AI search agents don't actually research on established benchmarks; they mostly use the web to confirm answers they already have. Once models have to go beyond their existing knowledge, search performance falls apart.

Frontier models like GPT-5.4, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek-V4-Pro, and Kimi-K2.6 keep posting higher scores on BrowseComp. The benchmark asks agents complex questions that can only be answered through multi-step browsing and piecing together information from different web sources.

Researchers from the Harbin Institute of Technology and Xiaohongshu have now shown in a study that these results say less about the agents' research skills than assumed. The authors call it "intrinsic knowledge dependence" (IKD), a reliance on internal knowledge the models absorbed during training.

Split diagram comparing static benchmarks and LiveBrowseComp. Static benchmarks show knowledge migrating into model parameters over generations, while LiveBrowseComp continuously refreshes its knowledge base.
With static benchmarks, the needed knowledge migrates into parameter memory over model generations, making tasks easier over time. LiveBrowseComp counters this with time-bound questions. | Image: Fan et al.

The researchers tested eleven models total, first stripping away all search and browsing tools. Even without internet access, the models scored surprisingly high. MiniMax M2.5 solved 44.5 percent of BrowseComp tasks from memory alone. Kimi K2.6 hit 62 percent on the Chinese BrowseComp-ZH variant. A big chunk of benchmark performance, in other words, comes before any search even happens.

Two heatmaps comparing six models across BrowseComp, BrowseComp-ZH, HLE, and GAIA benchmarks. Left side shows closed-book accuracy without tools, right side shows the additional gain from web search.
Even without tools, models score high: MiniMax M2.5 reaches 44.5 percent on BrowseComp. The actual contribution of web search is often small. | Image: Fan et al.

Searching can actually hurt the answer

The second test is more telling. The researchers left the search interface in place but removed all answer-supporting documents from the search index. Every model tested then performed worse than it did without any tool access at all. MiniMax M2.5 dropped from 44.5 to 8.0 percent. Kimi-K2.6 fell from 25.5 to 2.3 percent. The search actively pulls agents away from correct gut-feeling answers as soon as no confirming hits show up.

Two charts showing search behavior. Left: the share of model-originated queries rises to 70-80 percent as search progresses. Right: actual evidence use rate across four models sits between just 24.7 and 32.2 percent.
The further the search progresses, the more agents hunt for their own hypotheses instead of new facts. When they do find supporting sources, they use them less than a third of the time. | Image: Fan et al.

An analysis of the search paths explains why. More than half of all queries come from the model's own reasoning rather than from previously found hits. Even when relevant evidence does appear in search results, the agents fold it into their reasoning less than a third of the time. The loop is model-led, not evidence-led.

A benchmark beyond the knowledge frontier

To measure real search behavior, the authors built LiveBrowseComp. The benchmark contains 335 human-written questions, each depending on at least one fact from the 90 days before creation and impossible to answer without that current information.

The underlying events come from constantly updated sources like film databases, game directories, security vulnerability registers, and earthquake catalogs. Globally prominent events are filtered out deliberately, leaving obscure but publicly verifiable facts that had little chance of seeping into model parameters during training.

Flowchart of the LiveBrowseComp construction pipeline, from data sources through temporal filtering, longtail scoring, answer stability checks, and question construction to expert review and final benchmark tasks.
The pipeline filters only facts from the last 90 days, discards unstable answers, and has each question checked by experts for timeliness, difficulty, and clarity. | Image: Fan et al.

Human testers need about the same amount of time for LiveBrowseComp as for BrowseComp and solve a similar number of tasks. The performance drop among models is therefore due to losing the memory shortcut, not because the questions are harder.

Leaderboard rankings fall apart

On LiveBrowseComp, all models in the closed-book test fall below two percent accuracy. With tools turned on, scores land about 25 to 40 points below the same models' BrowseComp results.

Bar chart comparing closed-book accuracy on BrowseComp versus LiveBrowseComp for nine models. On BrowseComp, scores range from 11 to 44.5 percent. On LiveBrowseComp, all models drop below 2 percent.
Without tools, models solve up to 44.5 percent of BrowseComp questions from memory. On LiveBrowseComp, that number drops below two percent across the board, confirming the temporal block against parameter knowledge. | Image: Fan et al.

This shifts the rankings. GLM 5.1 leads clearly among open-source models on BrowseComp but falls to mid-pack on LiveBrowseComp. DeepSeek v3.2 sat at the bottom on BrowseComp, then climbed to the top on LiveBrowseComp, passing several models that previously outperformed it. This shows that a model's spot on a static leaderboard mostly shows how much it already knows, not how well it searches.

Agents need more steps when they can't rely on memory

On BrowseComp, agents solve many questions in very few steps, a sign of quick memory confirmation. On LiveBrowseComp, that pattern disappears. The step counts shift much higher, which suggests the agents are doing real research instead of recalling stored knowledge.

Six histograms showing search turns per question for Kimi K2.6, MiniMax M2.5, and GLM 5.1. On BrowseComp, solutions cluster at very few turns. On LiveBrowseComp, the distribution shifts toward much higher turn counts.
On BrowseComp, agents solve many questions in just a few steps, a pattern of quick memory confirmation. On LiveBrowseComp, that cluster disappears, and searches take far more rounds. | Image: Fan et al.

The authors argue that dynamic, time-sensitive benchmarks should become the standard for evaluating AI agents. They also want training signals that reward evidence-based research over the typical guess-and-verify approach.

Other studies have flagged similar problems. A benchmark from Peking University found that top models often produce the right answer when analyzing documents but cite the wrong source, what the researchers call "attribution hallucination." A tool called CiteAudit recently discovered that fabricated references have already made it into accepted papers at major AI conferences. The reason: commercial models don't reliably catch made-up citations.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder