Content
summary Summary

Microsoft is expanding its Phi series of compact language models with three new variants designed for advanced reasoning tasks.

Ad

The new models—Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning—are optimized to handle complex problems through structured reasoning and internal reflection, while remaining lightweight enough to run on lower-end hardware, including mobile devices. The models are part of Microsoft's Phi family of small language models and, as such, are designed to enable reasoning capabilities especially on lower-performance hardware such as mobile devices.

High efficiency with fewer parameters

Phi-4-reasoning contains 14 billion parameters and was trained via supervised fine-tuning using reasoning paths from OpenAI's o3-mini. A more advanced version, Phi-4-reasoning-plus, adds reinforcement learning and processes 1.5 times more tokens than the base model. This results in higher accuracy, but also increases response time and computational cost.

Bar chart: Comparison of the reasoning performance of Phi-4 (14B) with DeepSeek (70B, 671B) and o-mini on AIME, HMMT, OmniMath, GPQA.
Despite having just 14 billion parameters, the Phi-4 reasoning models match or outperform significantly larger systems, including the 70B parameter DeepSeek-R1-Distill-Llama. | Image: Microsoft

According to Microsoft, both models outperform larger language models such as OpenAI's o1-mini and DeepSeek-R1-Distill-Llama-70B—even though the latter is five times bigger. On the AIME-2025 benchmark, a qualifier for the U.S. Mathematical Olympiad, the Phi models also surpass DeepSeek-R1, which has 671 billion parameters.

Ad
Ad

Performance gains are not limited to math or science domains. Microsoft says the models also show strong results in programming, algorithmic problem-solving, and planning tasks. Improvements in logical reasoning are said to positively impact more general capabilities as well, such as following prompts or answering questions based on long-form content. "We observe a non-trivial transfer of improvements to general-purpose benchmarks as well," the researchers write.

Bar chart: Accuracy comparison of Phi-4, GPT-4o, and o3-mini on benchmarks such as FlenQA, IFEval, HumanEvalPlus, and MMLUPro.
On benchmarks such as HumanEvalPlus (coding) and MMLUPro (language understanding), Phi-4 reasoning models perform competitively with larger models such as GPT-4o and o3-mini. | Image: Microsoft

Phi-4-mini-reasoning brings reasoning to mobile

The smallest of the three models, Phi-4-mini-reasoning, is designed for mobile and embedded applications such as educational tools and tutoring systems. It uses a 3.8-billion-parameter architecture and was trained on more than a million math problems, ranging from middle school to postgraduate levels.

Bar chart: Performance of Phi-4-mini-reasoning (3.8B) vs. larger models on math benchmarks (AIME 24, MATH-500, GPQA Diamond).
With 3.8 billion parameters, Phi-4-mini-reasoning outperforms its base model—and even larger competitors—on math benchmarks. | Image: Microsoft

Despite its smaller size, Phi-4-mini-reasoning surpasses models like OpenThinker-7B and DeepSeek-R1-Distill-Qwen-7B in several evaluations. In mathematical problem-solving, its results match or exceed those of OpenAI's o1-mini.

Optimized for Windows integration

Microsoft says the new models are already optimized for use on Windows systems. A variant called Phi Silica is deployed on Copilot+ PCs. The model is integrated into tools like Outlook for offline summarization and the "Click to Do" feature, which provides contextual text functions directly on the screen. It runs directly on neural processing units (NPUs), enabling faster responses and lower power usage.

All three models—Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning—are available with open weights on both Azure AI Foundry and Hugging Face.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft has released three compact language models in the Phi series—Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning—aimed at handling reasoning tasks on devices with limited computing power.
  • The company states that the 14-billion parameter models (Phi-4-reasoning and Phi-4-reasoning-plus) and the smaller 3.8-billion parameter Phi-4-mini-reasoning at times outperform larger models like OpenAI o1-mini and DeepSeek models in benchmark tests, including on mathematical problems.
  • All three models are available with open weights on both Azure AI Foundry and Hugging Face.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.