Content
summary Summary

Microsoft has added two new models to its Phi small language model family: Phi-4-multimodal, which can handle audio, images and text simultaneously, and Phi-4-mini, a streamlined model focused on text processing.

Ad

Phi-4-multimodal introduces what Microsoft calls a "mixture-of-LoRAs" approach. This allows it to process text, audio, and visual input in a single representation space, eliminating the need for complex pipelines or separate models for different types of data.

The model has taken the top spot in the Huggingface OpenASR Leaderboard for automatic speech recognition with a word error rate of only 6.14 percent. This puts it ahead of specialized models such as WhisperV3 and SeamlessM4T-v2-Large. According to Microsoft, the model also excels in speech translation and summarization.

Despite its relatively compact size, Phi-4-multimodal also shows good performance on vision-related tasks, especially when it comes to mathematical and scientific reasoning, Microsoft says. The company reports that it rivals larger models such as Gemini-2-Flash-lite-preview and Claude-3.5-Sonnet in document and graph comprehension, optical character recognition, and visual scientific reasoning.

Ad
Ad
Comparison chart: benchmark results of 5 LLM models for AI2D, ChartQA, DocVQA and InfoVQA, Phi-4 leading with top results.
Microsoft's Phi-4-multimodal-instruct outperforms competing models in visual comprehension tasks. The benchmarks measure the ability to understand diagrams (AI2D), analyze graphics (ChartQA), interpret documents (DocVQA) and extract information from visual content (InfoVQA) - with Phi-4 as the clear leader in DocVQA (87.3%). | Image: Microsoft

Phi-4-mini: Focused on efficiency

The new Phi-4-mini comes in at 3.8 billion parameters and is built as a dense decoder-only transformer with a 128K context window. Microsoft has optimized it specifically for speed and efficiency.

One of Phi-4-mini's standout features is its function calling capability, which allows it to automatically use tools. The model can analyze queries, call relevant functions with appropriate parameters, receive results, and incorporate them into its responses using a standardized protocol. This makes it well suited for agent-based systems that need to connect to external tools, APIs, and data sources, Microsoft says.

Bar chart: Comparison of 10 AI language models across 7 benchmarks (MMLU-Pro to HumanEval), Phi-4 leads with 89.8% on GSM8K.
Microsoft's Phi-4-Mini performs well for its size in various benchmark tests, especially in GSM8K and Math. In some cases, the compact model outperforms much larger competitors such as Llama and Mistral. | Image: Microsoft

Real-world applications

Microsoft sees multiple applications for both models. Phi-4-multimodal could be integrated into smartphones to handle voice commands and image analysis, while also helping to improve driver assistance systems in cars. For Phi-4-mini, Microsoft sees particular value in financial services applications. It could support tasks such as calculations, report generation, and translation of financial documents.

Both models are available via Azure AI Foundry, Hugging Face and the NVIDIA API Catalog. These new additions join the Phi-4-14B language model, which Microsoft introduced in December 2024 and released with weights in January.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Microsoft is adding two new small language models to its Phi family: Phi-4-multimodal processes speech, images, and text simultaneously, while Phi-4-mini specializes in text-based tasks with 3.8 billion parameters.
  • Phi-4-multimodal uses a mixture-of-LoRAs approach to simultaneously process different types of input and, according to Microsoft, outperforms specialized models such as WhisperV3 in automatic speech recognition with a word error rate of 6.14 percent.
  • Phi-4-mini supports function calling for the use of external tools and can process sequences with up to 128,000 tokens. Both models are available via Azure AI Foundry, Hugging Face and the NVIDIA API Catalog.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.