Content
summary Summary

New independent evaluations reveal that Meta's latest Llama 4 models - Maverick and Scout - perform well in standard tests but struggle with complex long-context tasks.

Ad

According to the aggregated "Intelligence Index" from Artificial Analysis, Meta's Llama 4 Maverick scored 49 points while Scout reached 36. This places Maverick ahead of Claude 3.7 Sonnet but behind Deepseek's V3 0324. Scout performs on par with GPT-4o-mini and outperforms both Claude 3.5 Sonnet and Mistral Small 3.1.

Both models demonstrated consistent capabilities across general reasoning, coding, and mathematical tasks, without showing significant weaknesses in any particular area.

Bar chart: Intelligence index of 16 AI models, based on 7 evaluation criteria, scale of 24-53 points.
Artificial Analysis's Intelligence Index shows the relative strength of leading AI models across seven standardized tests. Deepseek leads at 53 points, with GPT-4o and Llama-4-Maverick following at 50 and 49 points respectively. | Image: Artificial Analysis

Maverick's architecture shows some efficiency, using just half the active parameters of Deepseek V3 (17 billion versus 37 billion) and about 60 percent of the total parameters (402 billion versus 671 billion). Unlike Deepseek V3, which only processes text, Maverick can also handle images.

Ad
Ad

Artificial Analysis reports median prices of $0.24/$0.77 per million input/output tokens for Maverick and $0.15/$0.4 for Scout. These rates undercut even the budget-friendly Deepseek-V3 and cost up to ten times less than OpenAI's GPT-4o.

Bar chart: Comparison of input and output prices per million tokens for 15 AI models, output prices up to $15.
Current AI model pricing shows significant variations between input and output costs, with the new Llama models among the most affordable options. | Image: Artificial Analysis

Questions arise over LMArena results

The Llama 4 launch hasn't been without controversy. Multiple testers report significant performance differences between LMArena - a benchmark Meta heavily promotes - and the model's performance on other platforms, even when using Meta's recommended system prompt.

Meta acknowledged using an "experimental chat version" of Maverick for this benchmark, suggesting possible optimization for human evaluators through detailed, well-structured responses with clear formatting.

In fact, when LMArena's "Style Control" is activated - a method that separates content quality from presentation style - Llama 4 drops from second to fifth place. This system attempts to isolate content quality by accounting for factors like response length and formatting. It's worth noting that other AI model developers likely employ similar benchmark optimization strategies.

Leaderboard table: AI model ranking with style control, showing arena scores, confidence intervals and licenses, Llama 4 in 5th place with 1307 points.
Llama 4 Maverick ranks 2nd without style control but drops to 5th place when style control is enabled. | Image: Screenshot LMArena.ai

Long-context performance falls short

The most significant issues emerged in tests by Fiction.live, which evaluate complex long-text comprehension through multi-layered narratives.

Recommendation

Fiction.live argues their tests better reflect real-world use cases by measuring actual understanding rather than just search capabilities. Models must track temporal changes, make logical predictions based on established information, and distinguish between reader knowledge and character knowledge.

Llama 4's performance disappointed in these challenging tests. Maverick showed no improvement over Llama 3.3 70B, while Scout performed "downright atrocious."

The contrast is stark: while Gemini 2.5 Pro maintains 90.6 percent accuracy with 120,000 tokens, Maverick achieves only 28.1 percent and Scout drops to 15.6 percent.

Table of comprehension scores for AI models with increasing text length (0-120k tokens), Gemini 2.5 leads with 90.6%, Llama models below 30% at maximum length.
Fiction.Live's long-context comprehension benchmark reveals significant long-context performance differences between models. | Image: Fiction.Live

These results challenge Meta's claims about long-context capabilities. Scout, advertised to handle 10 million tokens, struggles with just 128,000. Maverick also fails to consistently process documents within 128,000 tokens but claims a one-million token context window.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Research increasingly shows that large context windows offer fewer benefits than expected, as models struggle to evaluate all available information equally. Working with smaller contexts up to 128K often proves more effective, and users typically achieve even better results by breaking larger documents into chapters rather than processing them all at once.

Meta responds to mixed reception

In response to mixed performance reports, Meta's head of generative AI Ahmad Al-Dahle explains that early inconsistencies reflect temporary implementation challenges rather than limitations of the models themselves.

"Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in," Al-Dahle writes. He strongly denies allegations about test set training, stating "that's simply not true and we would never do that."

"Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations," Al-Dahle says, emphasizing that various services are still optimizing their Llama 4 deployments.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • In independent tests, Meta's latest AI models, Llama 4 Maverick and Scout, have demonstrated strong performance on standard benchmarks, but struggled with more complex, long-context tasks.
  • Maverick achieved an accuracy of only 28.1 percent in a realistic long-context test, while Scout managed just 15.6 percent, highlighting their limitations in handling extended text contexts.
  • Despite these weaknesses, both models performed competitively in general categories such as reasoning, programming, and mathematics, with Maverick outperforming Claude 3.7 in the Intelligence Index and Scout surpassing GPT-4o-mini and Claude 3.5.
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.