Content
summary Summary

Apple has released new performance data for its two in-house AI models and opened up the smaller system to third-party developers. The benchmarks show that Apple's LLM technology still trails the competition.

Ad

Apple developed two models: a compact 3-billion-parameter version for on-device use, and a larger server-based model. In Apple's own benchmarks, the 3B model edges out similarly sized models like Qwen-2.5-3B and gets close to Qwen-3-4B and Gemma-3-4B. Apple credits efficiency improvements for narrowing the gap, but the small difference in size makes these claims less convincing.

Diagrams: Human evaluation of Apple's AI text responses vs. others, by language group (English, English outside the US, eight others).
Human evaluations show Apple's on-device and server Foundation models lag well behind OpenAI's GPT-4o—one reason for the Apple's ChatGPT partnership. | Image: Apple

The server-based model performs on par with Llama-4-Scout. While Apple hasn't disclosed the parameter count, it says the model is similar in size to Meta's Scout, which has 109 billion total parameters and 17 billion active ones.

This server model uses a "parallel track mixture-of-experts" (MoE) architecture, allowing several smaller AI systems to run in parallel. Even so, it can't compete with much larger models like Qwen-3-235B or GPT-4o.

Ad
Ad
Parallel-Track-MoE: Subnetworks process tokens autonomously and synchronize only every 4 layers (−87.5%).
Apple's parallel-track MoE architecture lets subnetworks process tokens independently, synchronizing only every four layers to reduce communication by 87.5%. | Image: Apple

Apple uses aggressive compression to run the device model efficiently on iPhones and iPads, while the server model employs a specialized graphics compression technique.

Image recognition: Efficient, but not the leader

For image recognition, Apple's device model competes with InternVL-2.5-4B, Qwen-2.5-VL-3B-Instruct, and Gemma-3-4B. According to Apple, it outperforms InternVL and Qwen, but only matches Gemma-3-4B. The server model beats Qwen-2.5-VL-32B in less than half of test cases and still trails both Llama-4-Scout and GPT-4o.

Bar charts: Human evaluation of Apple's AI image understanding (on-device/server) vs. other models (win/tie/lose).
Human evaluations compare Apple's AI models (device and server versions) for image understanding against competing systems. | Image: Apple

Apple uses different image recognition systems for each model: the server model runs on a 1-billion-parameter AI, while the device model uses a 300-million-parameter version. Both were trained on over ten billion image-text pairs and 175 million documents with embedded images.

Developers get the smaller model

Developers now have access to the 3-billion-parameter model via Apple's new Foundation Models Framework. Apple says this model works best for tasks like summarization, information extraction, and text understanding—not as an open-ended chatbot. The more powerful server model is reserved for Apple and powers Apple Intelligence features.

The framework offers free AI features and is integrated with Apple's Swift programming language. Developers can tag data structures for automatic, relevant outputs, and a tool API lets them extend the model's abilities.

Recommendation

To improve multilingual performance, Apple expanded the models' vocabulary from 100,000 to 150,000 words. The company ran culture-specific tests in 15 languages to ensure appropriate responses across different regions. Training data comes from "hundreds of billions of pages" collected by Applebot, Apple's web crawler.

According to the company, Applebot respects robots.txt exclusions and does not use any user data for training. Whether treating a lack of opt-out as consent for AI training remains up for debate.

Apple's latest benchmarks confirm what was already suspected ahead of this year's WWDC: the company's AI models are still catching up to competitors like Google and OpenAI. The results make clear that Apple's systems can't match the technical performance of market leaders.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Apple has released performance data for its two AI models: the 3-billion-parameter device model matches the benchmark results of larger models like Qwen-3-4B and Gemma-3-4B, while the server-based model is comparable to Meta's Llama-4-Scout but falls short of larger systems such as GPT-4o.
  • In image understanding, the device model surpasses other compact systems like InternVL and Qwen and is close to Gemma-3-4B; the server model is more efficient than Qwen-2.5-VL-32B but does not reach the level of Llama-4-Scout or GPT-4o.
  • Developers can use the compact 3-billion-parameter model through Apple's new Foundation Models Framework, which provides free access and Swift integration, while the more capable server model is available only for Apple Intelligence features.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.