OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents
Key Points
- OpenSeeker is a fully open AI search agent that autonomously performs multi-level web searches, with all training data, code, and model weights publicly available, aiming to break the data monopoly held by major corporations like OpenAI or Alibaba.
- The model is trained on questions derived from the web's link structure, intentionally designed so that simple lookups won't suffice. A teacher-student process teaches it to extract relevant information from noisy raw data.
- Despite requiring minimal training resources, OpenSeeker performs on par with far more resource-intensive systems from large providers, though it still falls short of the strongest proprietary models.
With just 11,700 training data points and a single training run, the AI search agent OpenSeeker achieves results that rival solutions from Alibaba and others. Data, code, and model are all openly accessible.
Powerful AI search agents—systems that autonomously search the internet for information across multiple steps—have so far been big tech's territory. OpenAI, Google, and Alibaba keep their training data locked down. Even projects that publish their model weights stay quiet about the data behind them.
This data monopoly has held back the open research community for close to a year, according to researchers at Shanghai Jiao Tong University. With OpenSeeker, the academic team aims to change that: all training data (MIT license), the code, and the model weights are openly available.
Web link structures replace language model guesswork for training data
OpenSeeker builds on two core ideas for generating data. For question-answer pairs, the team uses the real link structure of the web as its foundation, generating questions from it. Starting from randomly selected seed pages within a web corpus (around 68 GB of English and 9 GB of Chinese data), the system follows hyperlinks to related pages and pulls out the most important information.
Specific names and terms then get swapped out for vague descriptions, so a search agent can't find the answer with a simple keyword search. This forces genuine multi-step search and reasoning.

A two-stage filter weeds out unusable questions: a strong base model must not be able to answer them without tools but must be able to solve them with full context. If either condition fails, the question gets tossed.
The second idea focuses on the search paths the model learns from. Web pages contain a lot of noise that drags down the quality of recorded solution paths. During data generation, a teacher model gets a cleaned-up summary of previous search results and makes better decisions based on that.
During training, the student model then sees the raw, uncleaned data but is still expected to reproduce the teacher's high-quality decisions. This forces it to figure out on its own how to separate signal from noise.
11,700 data points versus 147,000: data quality beats sheer volume
OpenSeeker is based on Qwen3-30B-A3B and was trained with just 11,700 data points in a single run using supervised fine-tuning, without any reinforcement learning or repeated tweaking.
According to the paper, the model hit 48.4 percent on the Chinese-language BrowseComp-ZH benchmark, beating Alibaba's Tongyi DeepResearch at 46.7 percent. Alibaba's model went through a three-stage process of extended pre-training, supervised fine-tuning, and reinforcement learning.

On OpenAI's English-language BrowseComp, OpenSeeker scores 29.5 percent - nearly double DeepDive's 15.3 percent, the previous leader among fully open agents.
A comparison with MiroThinker drives home how much data quality matters over raw quantity: that model was fed 147,000 training examples yet only manages 13.8 percent on BrowseComp-ZH. OpenSeeker hits 3.5 times that score with less than one-twelfth the data.

There's still a gap compared to the strongest proprietary systems, though. OpenAI's GPT-5-High hits 54.9 percent on BrowseComp, and DeepSeek-V3.2 with 671 billion parameters reaches 51.4 percent. OpenSeeker runs on a fraction of the model size and training effort.
The question of who gets access to high-quality training data has been a central issue in the AI industry for a while now. Last year, a research team released the Common Pile, an 8 TB text dataset built from openly licensed sources. So far, that hasn't done much to shake the dominance of commercial models.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe now