Alibaba’s research lab Tongyi has introduced ZeroSearch, a new method for training large language models to handle search tasks—without relying on real web searches.
For chatbots to answer questions accurately, especially when their built-in knowledge isn’t enough, they need to learn how to find information on the fly. Most current approaches use reinforcement learning (RL) and depend on actual search engines like Google to teach this skill. But according to Alibaba’s team, this is expensive, hard to control, and doesn’t scale well.
ZeroSearch takes a different approach: instead of using real web searches during training, it simulates the search process with a second language model. This model generates short texts in response to search queries, providing either relevant or intentionally irrelevant information—mimicking real search results, but under full control of the researchers.
Three-stage search simulation
The Qwen-2.5 language model, which is the main model being trained, goes through a structured learning process. In each round, it decides whether it needs to search for more information. If so, it crafts a query and sends it to the simulation model. The model then reviews the generated documents and responds, with its answer evaluated and getting feedback using RL.
At the start of training, the simulated search results are intentionally helpful. Over time, the quality is gradually reduced—a curriculum learning approach. This helps the model learn to draw useful conclusions even from unclear or conflicting information, much like searching the real internet.
The simulation model itself is fine-tuned beforehand, learning to generate both “useful” and “useless” search results. This distinction is controlled with subtle changes to the prompts—the instructions given to the model.
Successfully managing multi-level searches
Test runs show that the model can handle complex, multi-step search processes. In one example, it was asked, "Who is the spouse of the person who voices Smokey the Bear?" The simulated search first identified Sam Elliott as the voice actor. The model then conducted a second simulated search for Sam Elliott’s spouse, finding Katharine Ross. It combined both pieces of information correctly and produced an accurate answer.
This ability to break down a question into sub-questions and build on intermediate results is a key goal of ZeroSearch training.
Significant cost savings-and full control
Simulating the search process not only removes dependency on external search services, but also cuts costs dramatically. In experiments, running 64,000 searches through Google’s SerpAPI cost about $586 in API fees. By contrast, using the simulation model on four rented AWS A100 GPUs cost just $71 in compute time.
Another benefit: the simulated search is always available, produces responses in a consistent style, and can be made harder or easier as needed. According to the team, this makes training more predictable and robust.
Outperform Google searches in training
The team evaluated ZeroSearch on seven well-known question-answering benchmarks, including Natural Questions, TriviaQA, and HotpotQA. It matched or outperformed approaches trained with real Google searches, especially when using a large simulation model with 14 billion parameters.
Smaller models with 7 billion parameters also performed well. The key wasn’t just size, but whether the simulation model had been specifically fine-tuned for the task—models only controlled by prompts did much worse.
Alibaba has released some of its models on HuggingFace. More details and the code are available on GitHub.