Content
summary Summary

A new study from Oppo's AI team reveals systematic flaws in "deep research" systems designed to automate complex reporting. Nearly 20 percent of errors stem from systems inventing plausible-sounding but entirely fake content.

Ad

The researchers analyzed around 1,000 reports using two new evaluation tools: FINDER, a benchmark for deep research tasks, and DEFT, a taxonomy for classifying failures.

To feign competence, one system claimed an investment fund achieved an exact 30.2 percent annual return over 20 years. Since such specific data isn't public, the AI likely fabricated the figure.

In another test involving scientific papers, a system listed 24 references. A check revealed several links were dead, while others pointed to reviews rather than original research—yet the system insisted it had verified every source.

Ad
Ad

The team identified 14 error types across three categories: reasoning, retrieval, and generation. Generation issues topped the list at 39 percent, followed by research failures at 33 percent and reasoning errors at 28 percent.

Table shows DEFT taxonomy with three main categories (Level 1) and 14 error types (Level 2). Reasoning contains four error types, retrieval five, and generation five, including strategic content fabrication as the most common problem.
The DEFT taxonomy breaks down errors into three main buckets. Reasoning covers issues like rigid planning, while retrieval and generation track problems with verification and content fabrication. | Image: Oppo AI Agent Team

Systems fail to adapt when plans go wrong

Most systems understand the assignment; the failure happens during execution. If a system plans to analyze a database but gets locked out, it doesn't change strategies. Instead, it simply fills the blank sections with hallucinated content.

Flowchart shows the information seeking process from information seeking to information understanding to info usage, integration, and presentation. A verify step checks the process. Five types of errors are marked: VMF (verification mechanism failure), IIA (insufficient information acquisition), IHD (information handling deficiency), IIF (information integration failure), and IRM (information representation misalignment).
Errors can creep in at any stage, from information usage to final presentation. Without a final verification step, systems often end up presenting baseless claims as fact. | Image: Oppo AI Agent Team

Researchers describe this as a lack of "reasoning resilience"—the ability to adapt when things go wrong. In real-world scenarios, this flexibility matters more than raw analytical power.

To test this, the team built the FINDER benchmark, featuring 100 complex tasks that require hard evidence and strict methodology.

Leading models struggle to pass the benchmark

The study tested commercial tools like Gemini 2.5 Pro Deep Research and OpenAI's o3 Deep Research against open-source alternatives. Gemini 2.5 Pro took the top spot but only scored 51 out of 100 points. OpenAI's o3 stood out for factual accuracy, getting nearly 66 percent of its citations right.

Recommendation
Table showing evaluation results for 13 AI research agents, divided into proprietary APIs, open-source models, and agent frameworks. Gemini 2.5 Pro Deep Research achieves the highest RACE overall score of 50.95 points. O3 Deep Research leads in citation accuracy with 65.98 percent. The table also shows positive taxonomy metrics for reasoning, retrieval, and generation, as well as checklist pass rates between 44.87 and 72.19 percent.
The benchmark scores systems on report quality (RACE) and factual accuracy (FACT). While Gemini led overall, OpenAI's model proved more reliable for citations. | Image: Oppo AI Agent Team

According to the study, systems don't fail because they are confused by the prompt, but because they struggle to integrate evidence or handle uncertainty. Rather than hiding gaps with fake details, these agents need transparent ways to admit what they don't know.

The researchers have released the FINDER and DEFT frameworks on GitHub to help the community build more reliable agents.

The timing is relevant. Since late 2024, Google, Perplexity, Grok, and OpenAI have all rolled out "deep research" features that promise comprehensive reports in minutes, often scraping hundreds of websites at once. But as the study shows, simply throwing more data at the problem doesn't guarantee better results and might actually multiply errors.

The industry is well aware of these limitations. OpenAI recently admitted that LLM-based systems like ChatGPT will likely never stop fabricating information entirely. To address this, the company is working on features that allow the system to indicate its certainty level. It is also experimenting with "confessions", a mechanism where the system generates a separate follow-up note admitting if it made something up or was unsure.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study finds that deep research systems frequently generate convincing but false information, with such fabrications making up nearly 20 percent of all mistakes.
  • The main issues stem from inflexible workflows, insufficient verification, and poor adaptability, which together cause a high rate of errors during the research process.
  • Even top-performing systems like Gemini 2.5 Pro score only 51 out of 100 in testing, mainly due to their lack of flexible approaches and thorough checking routines.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.