Microsoft expands its SLM lineup with new multimodal and mini Phi-4 models

Midjourney prompted by THE DECODER

Microsoft has added two new models to its Phi small language model family: Phi-4-multimodal, which can handle audio, images and text simultaneously, and Phi-4-mini, a streamlined model focused on text processing.

Phi-4-multimodal introduces what Microsoft calls a "mixture-of-LoRAs" approach. This allows it to process text, audio, and visual input in a single representation space, eliminating the need for complex pipelines or separate models for different types of data.

The model has taken the top spot in the Huggingface OpenASR Leaderboard for automatic speech recognition with a word error rate of only 6.14 percent. This puts it ahead of specialized models such as WhisperV3 and SeamlessM4T-v2-Large. According to Microsoft, the model also excels in speech translation and summarization.

Despite its relatively compact size, Phi-4-multimodal also shows good performance on vision-related tasks, especially when it comes to mathematical and scientific reasoning, Microsoft says. The company reports that it rivals larger models such as Gemini-2-Flash-lite-preview and Claude-3.5-Sonnet in document and graph comprehension, optical character recognition, and visual scientific reasoning.

Comparison chart: benchmark results of 5 LLM models for AI2D, ChartQA, DocVQA and InfoVQA, Phi-4 leading with top results. — Microsoft's Phi-4-multimodal-instruct outperforms competing models in visual comprehension tasks. The benchmarks measure the ability to understand diagrams (AI2D), analyze graphics (ChartQA), interpret documents (DocVQA) and extract information from visual content (InfoVQA) - with Phi-4 as the clear leader in DocVQA (87.3%). | Image: Microsoft

Phi-4-mini: Focused on efficiency

The new Phi-4-mini comes in at 3.8 billion parameters and is built as a dense decoder-only transformer with a 128K context window. Microsoft has optimized it specifically for speed and efficiency.

One of Phi-4-mini's standout features is its function calling capability, which allows it to automatically use tools. The model can analyze queries, call relevant functions with appropriate parameters, receive results, and incorporate them into its responses using a standardized protocol. This makes it well suited for agent-based systems that need to connect to external tools, APIs, and data sources, Microsoft says.

Bar chart: Comparison of 10 AI language models across 7 benchmarks (MMLU-Pro to HumanEval), Phi-4 leads with 89.8% on GSM8K. — Microsoft's Phi-4-Mini performs well for its size in various benchmark tests, especially in GSM8K and Math. In some cases, the compact model outperforms much larger competitors such as Llama and Mistral. | Image: Microsoft

Real-world applications

Microsoft sees multiple applications for both models. Phi-4-multimodal could be integrated into smartphones to handle voice commands and image analysis, while also helping to improve driver assistance systems in cars. For Phi-4-mini, Microsoft sees particular value in financial services applications. It could support tasks such as calculations, report generation, and translation of financial documents.

Both models are available via Azure AI Foundry, Hugging Face and the NVIDIA API Catalog. These new additions join the Phi-4-14B language model, which Microsoft introduced in December 2024 and released with weights in January.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Microsoft expands its SLM lineup with new multimodal and mini Phi-4 models

Phi-4-mini: Focused on efficiency

Real-world applications

Microsoft introduces Phi-4-mini-flash-reasoning with up to 10x higher token throughput

Microsoft's Phi 4 responds to a simple "Hi" with 56 thoughts

Microsoft's Phi-4-reasoning models outperform larger models and run on your laptop or phone

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Microsoft expands its SLM lineup with new multimodal and mini Phi-4 models

Phi-4-mini: Focused on efficiency

Real-world applications

Microsoft introduces Phi-4-mini-flash-reasoning with up to 10x higher token throughput

Microsoft's Phi 4 responds to a simple "Hi" with 56 thoughts

Microsoft's Phi-4-reasoning models outperform larger models and run on your laptop or phone