Content
summary Summary

Anthropic has announced upgrades to its Claude AI models, including an enhanced Claude 3.5 Sonnet and a new Claude 3.5 Haiku. The company is also introducing a new feature that allows the model to interact directly with computer interfaces.

Ad

The updated Claude 3.5 Sonnet shows significant improvements in programming tasks. Its performance on the SWE Bench Verified Test increased from 33.4% to 49.0%, which Anthropic claims outperforms all publicly available models, including specialized programming systems.

Sonnet also made strides in the TAU Bench, a test for agentic tool use. In the retail sector, its performance rose from 62.6% to 69.2%, while in the more challenging aviation sector, it improved from 36.0% to 46.0%.

Table: Comparison of AI models across various benchmarks. Claude 3.5 Sonnet (New) leads in several categories, including GPQA, MMLU, HumanEval, and AIME 2024.
The new sonnet makes the biggest leaps in reasoning and agentic tool testing. | Image: Anthropic

New Haiku model outperforms previous flagship

Anthropic is also introducing a new Claude 3.5 Haiku model. The company claims that this model outperforms the previous top-of-the-line Claude 3 Opus on many benchmarks, while maintaining similar speed and cost as the previous Claude 3 Haiku. Notably, Anthropic did not mention any plans for a new Opus model in this announcement.

Ad
Ad
Comparison table: AI model performance in various benchmarks, Claude 3.5 Sonnet (new) leading in several categories.
The new Claude 3.5 Sonnet model shows improved performance, especially in logical reasoning, mathematical problem-solving and programming tasks. On the general language comprehension benchmark MMLU, it is only slightly ahead of the old Sonnet 3.5. | Image: Anthropic

The new Claude 3.5 Haiku demonstrates impressive capabilities relative to its speed and cost in programming tasks. It scores 40.6% on the SWE-bench Verified test, which Anthropic says exceeds the performance of many agents based on "publicly available state-of-the-art models," including GPT-4o.

Regarding knowledge cutoff dates, Sonnet 3.5 is current through April 2024, while the new Haiku model has information up to July 2024. Anthropic plans to release Haiku later this month.

AI-driven computer interaction

Anthropic describes its new "computer use" feature as a significant innovation. Rather than developing specific tools for individual tasks, the company is taking a broader approach by teaching Claude general computer skills. This allows the AI to use various standard tools and software programs originally designed for human use.

Anthropic has developed an API that enables Claude to perceive and interact with computer interfaces. Developers can integrate this API to allow Claude to translate instructions like "Use data from my computer and the internet to fill out this form" into actual computer commands.

The system can move the mouse pointer, click on screen elements, and enter information using a virtual keyboard. In the OSWorld benchmark, which assesses AI models' ability to use computers in a human-like manner, Claude 3.5 Sonnet scored 14.9% in the "screenshots only" category. While this is significantly higher than the next best AI system at 7.8%, it still falls far short of human capabilities.

Recommendation

Anthropic recognizes that Claude's current computer interaction skills are imperfect. Some actions that humans find effortless, such as scrolling, dragging, or zooming, are still challenging for Claude. The company recommends that developers start with low-risk tasks when implementing this feature.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Anthropic presents improved versions of its AI models Claude 3.5 Sonnet and Claude 3.5 Haiku. Both models achieved significant performance improvements, particularly in programming tasks.
  • The new Claude 3.5 Haiku model is expected to outperform the previous top-of-the-line Claude 3 Opus model in many intelligence benchmarks, at the same cost and similar speed as its predecessor, Claude 3 Haiku.
  • Anthropic introduces a new feature for AI-driven computing. An API enables Claude to perceive computer surfaces, interact with them and translate instructions into concrete computer commands. However, the system is still a long way from human capabilities.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.