Content
summary Summary

A new review shows that while OpenAI was first to push reasoning-enabled language models into the spotlight, Deepseek-R1 has kicked research in this area into a higher gear.

Ad

Since its release about four months ago, Deepseek-R1 has attracted attention for delivering strong logical reasoning with far fewer training resources than earlier models. Its launch set off a flurry of replication efforts across the industry—Meta, for instance, reportedly formed special teams to study and mimic the model.

Researchers from an SEO agency and several universities in China and Singapore have now looked at how R1 has shifted the landscape. Their analysis suggests that, while OpenAI set the course, Deepseek-R1 played a major role in speeding up the recent surge of reasoning-focused language models.

Better data, better results

One key factor was supervised fine-tuning (SFT), where base models are retrained using carefully curated, step-by-step explanations. The meta-analysis found that quality matters more than sheer volume: A few thousand rigorously vetted examples can raise even 7B or 1.5B models to a high level, while millions of poorly filtered samples yield little improvement.

Ad
Ad

This challenges the older assumption that deep reasoning always requires massive models. The underlying architecture still sets the upper limits, but reasoning-oriented models can make more efficient use of those resources in some areas.

Reinforcement learning has also become more important for building reasoning skills. Two algorithms stand out: PPO (Proximal Policy Optimization) and GRPO (Group Relative Policy Optimization). Both were around before Deepseek-R1, but the surge in interest has brought them into much wider use.

PPO tweaks the model’s weights step by step, but only enough to keep new strategies close to previous ones. A built-in clipping mechanism prevents major jumps and keeps training stable.

GRPO takes this further by generating several answer options for each prompt, comparing their rewards within a group, and updating the model based on their relative scores. With group normalization, GRPO doesn’t need a separate value network and remains efficient, even with long, chain-of-thought responses.

New strategies in training

Researchers have been testing new approaches to training these models. One effective method is to start with shorter answers and gradually increase their length. Curriculum learning—where tasks get harder step by step—has also shown good results. According to the study, this suggests that AI models may learn in ways that resemble how people learn new skills.

Recommendation

Another major trend is bringing reasoning skills into multimodal tasks. Early research has focused on transferring these abilities to image and audio analysis, and so far, reasoning developed in text models often carries over to other areas.

OpenAI's latest o3 model, for example, incorporates images and tool use directly into its reasoning process—something that wasn’t available or highlighted when the model launched last December. Still, researchers say there’s a lot of room for improvement.

Reasoning introduces new challenges

Better reasoning also means new challenges around safety and efficiency. Researchers have been working on ways to prevent unwanted behaviors like "overthinking".

One example: Microsoft's Phi 4 reasoning model reportedly generates over 50 "thoughts" just to answer a simple "Hi." An analysis by Artificial Analysis found that reasoning increases the token use of Google’s Flash 2.5 model by a factor of 17, which drives up both computation and cost.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

While reasoning can improve the quality and safety of AI outputs, it can also mean much higher computational demands, rising costs, and sometimes inefficient behavior.

This makes it more important to choose the right tool for the job. For now, there’s no clear consensus on when to use a standard LLM and when to reach for a reasoning model—except for especially complex logic, science, or coding problems.

OpenAI recently published a guide for picking among its own models. The advice offers a starting point, but doesn’t fully settle when reasoning is the right choice. In practice, it depends on the context—and on balancing efficiency, cost, and how deep an answer you need.

Model Core strength Ideal first reach‑for Watch‑outs Escalate / Downgrade path
GPT‑4o Real‑time voice / vision chat Live multimodal agents Slightly below 4.1 on text SOTA (state-of-the-art) Need deep reasoning → o4‑mini
GPT‑4.1 1 M‑token text accuracy king Long‑doc analytics, code review Cannot natively reason; higher cost than minis Tight budget → 4.1‑mini / nano
o3 Deep tool‑using agent High‑stakes, multi‑step reasoning Latency & price Cost/latency → o4‑mini
o4‑mini Cheap, fast reasoning High‑volume "good‑enough" logic Depth ceiling vs o3 Accuracy critical → o3

Safety is another major concern. Reasoning models may be harder to jailbreak thanks to their structured thinking process, but they also come with new risks: If the reasoning logic is manipulated, these systems can still be tricked into producing harmful or problematic outputs—even when safeguards are in place. As a result, jailbreaking attacks remain an ongoing challenge.

The study concludes that Deepseek-R1 has played a key role in speeding up the development of reasoning language models. The authors see these advances as just the beginning, with the next phase focused on expanding reasoning to new applications, improving reliability, and finding even more efficient ways to train these systems.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new review study finds that the release of Deepseek-R1 has significantly sped up the progress of language models with reasoning skills, sparking advancements in training methods, multimodal abilities, and security.
  • The review points out that supervised fine-tuning with carefully selected datasets, along with growing use of reinforcement learning techniques like PPO and GRPO, has made it possible for even smaller models to efficiently reach strong reasoning performance.
  • The study also notes emerging issues, including much higher resource use and new security risks tied to advanced reasoning, but stresses the strong potential for future research in this area.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.