Researchers discover three factors that make AI agents significantly smarter

Oct 25, 2025

Key Points

Researchers found that high-quality, diverse data and well-designed task distribution are key for effective AI training, while algorithm design and the right mindset also play a major role.
New training algorithms using token-based scoring, optimized rewards, and special exploration strategies led to more stable learning and higher success rates compared to traditional methods.
The DemyAgent-4B model, with 4 billion parameters, matched or outperformed much larger models with up to 32 billion parameters on several benchmarks, and both the models and datasets are available for public use.

Researchers from the National University of Singapore, Princeton, and the University of Illinois Urbana-Champaign have pinpointed three key factors that make AI agents smarter: data quality, algorithm design, and reasoning strategy.

Their work shows that a carefully trained 4-billion-parameter model can match or even beat competitors with up to 32 billion parameters.

Infographic on the three key factors for AI training: 1. Data quality (real vs. synthetic), 2. Algorithm design (techniques and training dynamics), and 3. AI mindset (reactive vs. deliberative). — The team broke down what makes AI agents excel by analyzing the impact of data quality (left), training algorithms (middle), and reasoning modes (right). | Image: Yu et al

Real data beats synthetic every time

The type of data used during training matters most. The researchers compared models trained on authentic learning trajectories against those trained on artificial data, where intermediate steps were replaced by tool outputs.

On AIME math benchmarks, a 4-billion-parameter model trained on real data achieved 29.79 percent accuracy. The same model trained on synthetic data scored under 10 percent.

Table showing that two AI models (Qwen-7B and Qwen-4B) consistently achieve significantly better results on the AIME math benchmarks when trained with — Training with real, continuous learning data leads to significantly higher accuracy than synthetic alternatives. | Image: Yu et al.

Real data captures the full reasoning workflow, including pre-tool analysis, guided execution, error correction, and self-reflection. Synthetic data can’t replicate these links, the researchers say.

Data diversity proved equally crucial. A mixed dataset of 30,000 examples from math, science, and programming accelerated learning dramatically. The AI hit 50 percent accuracy after just 150 training steps, while a math-only dataset needed 220 steps to reach the same benchmark.

Token-based scoring makes the difference

The second factor is how the learning process itself is structured. The team tested three algorithm variants, looking for the best way to optimize performance. The winner was a method called GRPO-TCR, which combines token-level scoring (grading each word chunk), broader clipping for more exploration, and a reward setup to discourage overly long answers.

Grid of 12 graphs comparing three training algorithms: The token-based method GRPO-TCR (orange) consistently outperforms the sentence-based method GRPO-SCR (green) and the basic algorithm GRPO-T (blue). — The GRPO-TCR approach (orange) consistently outperforms both the sentence-based method (green) and the basic algorithm (blue) in these benchmarks. | Image: Yu et al.

This optimized approach achieved 70.93 percent accuracy on one math benchmark and 68.13 percent on another. Token-based scoring proved particularly powerful, outperforming sentence-based methods by about 4 percent. Unlike traditional reinforcement learning, agents can improve both exploration and precision simultaneously through tool interactions.

Think more, act less

The third finding is about how the AI organizes its reasoning. The researchers found two main styles: reactive (short thinking, frequent tool use) and deliberative (longer thinking, fewer tool calls). Models using the deliberative strategy consistently achieved over 70 percent success rates in tool use. Reactive models performed poorly, as their rapid-fire tool calls were often ineffective or wrong. Quality trumps quantity every time.

Four line graphs showing that the basic algorithm GRPO-T (blue) exhibits reactive behavior (many tool calls, short responses), while the more powerful algorithms GRPO-TCR (orange) and GRPO-SCR (green) exhibit deliberative behavior (few tool calls, long responses). — The baseline algorithm (blue) tends to be reactive, making lots of tool calls and giving short answers. The more advanced algorithms take a slower, more thoughtful approach with fewer tool calls and longer responses. | Image: Yu et al.

Interestingly, current long-chain-of-thought models struggle with tool integration. Despite being optimized for extended thinking, they tend to avoid tool calls entirely, relying only on internal reasoning processes.

Small but mighty

Applying these insights, the researchers built DemyAgent-4B with just 4 billion parameters. The results speak for themselves: 72.6 percent on AIME2024, 70 percent on AIME2025, 58.5 percent on GPQA-Diamond science tests, and 26.8 percent on LiveCodeBench-v6 programming benchmarks. This performance places it firmly among competitors with 14 to 32 billion parameters, proving that smart training beats brute force.

Table listing the performance of DemyAgent-4B compared to other AI models. The values prove that the 4-billion-parameter model can compete with larger competitors such as rStar2-Agent-14B and ReTool-32B. — DemyAgent-4B (just 4 billion parameters) holds its own against much larger models like rStar2-Agent-14B and ReTool-32B in math, science, and code benchmarks. | Image: Yu et al.

The researchers have released their training data and model weights for others to use and build on.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Arxiv