OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

Mar 23, 2026

Nano Banana Pro prompted by THE DECODER

With just 11,700 training data points and a single training run, the AI search agent OpenSeeker achieves results that rival solutions from Alibaba and others. Data, code, and model are all openly accessible.

Powerful AI search agents—systems that autonomously search the internet for information across multiple steps—have so far been big tech's territory. OpenAI, Google, and Alibaba keep their training data locked down. Even projects that publish their model weights stay quiet about the data behind them.

This data monopoly has held back the open research community for close to a year, according to researchers at Shanghai Jiao Tong University. With OpenSeeker, the academic team aims to change that: all training data (MIT license), the code, and the model weights are openly available.

Web link structures replace language model guesswork for training data

OpenSeeker builds on two core ideas for generating data. For question-answer pairs, the team uses the real link structure of the web as its foundation, generating questions from it. Starting from randomly selected seed pages within a web corpus (around 68 GB of English and 9 GB of Chinese data), the system follows hyperlinks to related pages and pulls out the most important information.

Specific names and terms then get swapped out for vague descriptions, so a search agent can't find the answer with a simple keyword search. This forces genuine multi-step search and reasoning.

OpenSeeker's training question pipeline showing the process from web corpus through graph expansion, entity extraction, question generation, and obfuscation to a two-stage verification filter for difficulty and solvability. — OpenSeeker's training question pipeline: from web corpus to graph expansion and obfuscation to double quality checking. | Image: Du et al.

A two-stage filter weeds out unusable questions: a strong base model must not be able to answer them without tools but must be able to solve them with full context. If either condition fails, the question gets tossed.

The second idea focuses on the search paths the model learns from. Web pages contain a lot of noise that drags down the quality of recorded solution paths. During data generation, a teacher model gets a cleaned-up summary of previous search results and makes better decisions based on that.

During training, the student model then sees the raw, uncleaned data but is still expected to reproduce the teacher's high-quality decisions. This forces it to figure out on its own how to separate signal from noise.

11,700 data points versus 147,000: data quality beats sheer volume

OpenSeeker is based on Qwen3-30B-A3B and was trained with just 11,700 data points in a single run using supervised fine-tuning, without any reinforcement learning or repeated tweaking.

According to the paper, the model hit 48.4 percent on the Chinese-language BrowseComp-ZH benchmark, beating Alibaba's Tongyi DeepResearch at 46.7 percent. Alibaba's model went through a three-stage process of extended pre-training, supervised fine-tuning, and reinforcement learning.

Bar chart comparing OpenSeeker across four benchmarks against proprietary and open-source models. OpenSeeker outperforms all fully open agents and beats Tongyi DeepResearch on BrowseComp-ZH. — OpenSeeker (orange) is the only fully open agent to achieve competitive results on all four benchmarks. | Image: Du et al.

On OpenAI's English-language BrowseComp, OpenSeeker scores 29.5 percent - nearly double DeepDive's 15.3 percent, the previous leader among fully open agents.

A comparison with MiroThinker drives home how much data quality matters over raw quantity: that model was fed 147,000 training examples yet only manages 13.8 percent on BrowseComp-ZH. OpenSeeker hits 3.5 times that score with less than one-twelfth the data.

Two histograms comparing token length and number of tool calls between OpenSeeker's Chinese training data and the BrowseComp-ZH benchmark. OpenSeeker's data is significantly longer and requires more search steps. — OpenSeeker's Chinese training data requires an average of 46 tool calls per task, compared to just 27 for BrowseComp-ZH. | Image: Du et al.

There's still a gap compared to the strongest proprietary systems, though. OpenAI's GPT-5-High hits 54.9 percent on BrowseComp, and DeepSeek-V3.2 with 671 billion parameters reaches 51.4 percent. OpenSeeker runs on a fraction of the model size and training effort.

The question of who gets access to high-quality training data has been a central issue in the AI industry for a while now. Last year, a research team released the Common Pile, an 8 TB text dataset built from openly licensed sources. So far, that hasn't done much to shake the dominance of commercial models.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

AI news without the hype
Curated by humans.

More than 16% discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

Web link structures replace language model guesswork for training data

11,700 data points versus 147,000: data quality beats sheer volume

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.