Deepseek-R1 triggers boom in reasoning-enabled language models

A new review shows that while OpenAI was first to push reasoning-enabled language models into the spotlight, Deepseek-R1 has kicked research in this area into a higher gear.

Since its release about four months ago, Deepseek-R1 has attracted attention for delivering strong logical reasoning with far fewer training resources than earlier models. Its launch set off a flurry of replication efforts across the industry—Meta, for instance, reportedly formed special teams to study and mimic the model.

Researchers from an SEO agency and several universities in China and Singapore have now looked at how R1 has shifted the landscape. Their analysis suggests that, while OpenAI set the course, Deepseek-R1 played a major role in speeding up the recent surge of reasoning-focused language models.

Better data, better results

One key factor was supervised fine-tuning (SFT), where base models are retrained using carefully curated, step-by-step explanations. The meta-analysis found that quality matters more than sheer volume: A few thousand rigorously vetted examples can raise even 7B or 1.5B models to a high level, while millions of poorly filtered samples yield little improvement.

This challenges the older assumption that deep reasoning always requires massive models. The underlying architecture still sets the upper limits, but reasoning-oriented models can make more efficient use of those resources in some areas.

Reinforcement learning has also become more important for building reasoning skills. Two algorithms stand out: PPO (Proximal Policy Optimization) and GRPO (Group Relative Policy Optimization). Both were around before Deepseek-R1, but the surge in interest has brought them into much wider use.

PPO tweaks the model’s weights step by step, but only enough to keep new strategies close to previous ones. A built-in clipping mechanism prevents major jumps and keeps training stable.

GRPO takes this further by generating several answer options for each prompt, comparing their rewards within a group, and updating the model based on their relative scores. With group normalization, GRPO doesn’t need a separate value network and remains efficient, even with long, chain-of-thought responses.

New strategies in training

Researchers have been testing new approaches to training these models. One effective method is to start with shorter answers and gradually increase their length. Curriculum learning—where tasks get harder step by step—has also shown good results. According to the study, this suggests that AI models may learn in ways that resemble how people learn new skills.

Recommendation

AI research

Apple's local AI agent framework paves the way for more useful Apple Intelligence

Another major trend is bringing reasoning skills into multimodal tasks. Early research has focused on transferring these abilities to image and audio analysis, and so far, reasoning developed in text models often carries over to other areas.

OpenAI's latest o3 model, for example, incorporates images and tool use directly into its reasoning process—something that wasn’t available or highlighted when the model launched last December. Still, researchers say there’s a lot of room for improvement.

Reasoning introduces new challenges

Better reasoning also means new challenges around safety and efficiency. Researchers have been working on ways to prevent unwanted behaviors like "overthinking".

One example: Microsoft's Phi 4 reasoning model reportedly generates over 50 "thoughts" just to answer a simple "Hi." An analysis by Artificial Analysis found that reasoning increases the token use of Google’s Flash 2.5 model by a factor of 17, which drives up both computation and cost.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

While reasoning can improve the quality and safety of AI outputs, it can also mean much higher computational demands, rising costs, and sometimes inefficient behavior.

This makes it more important to choose the right tool for the job. For now, there’s no clear consensus on when to use a standard LLM and when to reach for a reasoning model—except for especially complex logic, science, or coding problems.

OpenAI recently published a guide for picking among its own models. The advice offers a starting point, but doesn’t fully settle when reasoning is the right choice. In practice, it depends on the context—and on balancing efficiency, cost, and how deep an answer you need.

Model	Core strength	Ideal first reach‑for	Watch‑outs	Escalate / Downgrade path
GPT‑4o	Real‑time voice / vision chat	Live multimodal agents	Slightly below 4.1 on text SOTA (state-of-the-art)	Need deep reasoning → o4‑mini
GPT‑4.1	1 M‑token text accuracy king	Long‑doc analytics, code review	Cannot natively reason; higher cost than minis	Tight budget → 4.1‑mini / nano
o3	Deep tool‑using agent	High‑stakes, multi‑step reasoning	Latency & price	Cost/latency → o4‑mini
o4‑mini	Cheap, fast reasoning	High‑volume "good‑enough" logic	Depth ceiling vs o3	Accuracy critical → o3

Safety is another major concern. Reasoning models may be harder to jailbreak thanks to their structured thinking process, but they also come with new risks: If the reasoning logic is manipulated, these systems can still be tricked into producing harmful or problematic outputs—even when safeguards are in place. As a result, jailbreaking attacks remain an ongoing challenge.

The study concludes that Deepseek-R1 has played a key role in speeding up the development of reasoning language models. The authors see these advances as just the beginning, with the next phase focused on expanding reasoning to new applications, improving reliability, and finding even more efficient ways to train these systems.

Deepseek-R1 triggers boom in reasoning-enabled language models

Better data, better results

New strategies in training

Apple's local AI agent framework paves the way for more useful Apple Intelligence

Reasoning introduces new challenges

Google DeepMind unveils an AI model that acts as a "virtual satellite" for mapping the entire planet

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Anthropic says that AI can learn risky behaviors even when the training data looks completely safe

Google upgrades Gemini with Deep Think and flags early warning risks

OpenAI’s math breakthrough might also mean AI is getting better at knowing its own limits

Google DeepMind's Gemini wins Mathematical Olympiad gold using only natural language

Deepseek-R1 triggers boom in reasoning-enabled language models

Better data, better results

New strategies in training

Reasoning introduces new challenges

Share

Bank details