Content
summary Summary

Despite access to research tools and high processing costs, leading language models fell short on complex financial tasks.

Ad

A new benchmark from Vals.ai suggests that even the most advanced autonomous AI agents remain unreliable for financial analysis. The best-performing model, OpenAI's o3, managed just 48.3% accuracy—at an average cost of $3.69 per query.

The benchmark was developed in collaboration with a Stanford lab and a global systemically important bank. It consists of 537 tasks modeled on real-world responsibilities of financial analysts, including SEC document review, market research, and forecasting. In total, 22 leading foundation models were evaluated.

Table: Ranking of AI models in financial benchmarking according to accuracy (%), cost/query ($), and latency (s).
The "Accuracy" score in the Vals.ai benchmark reflects the percentage of tasks each model completed correctly. The benchmark tests factual knowledge, research tool usage, and financial reasoning. | Source: Vals.ai

Basic tasks show promise, but financial reasoning remains unreachable

The models demonstrated limited success with basic assignments like extracting numerical data or summarizing text, where average accuracy ranged from 30% to 38%. However, they largely failed on more complex tasks. In the "Trends" category, ten models scored 0%, with the best result—28.6%—coming from Claude 3.7 Sonnet.

Ad
Ad

To complete these tasks, the benchmark environment gave agents access to tools such as EDGAR search, Google, and an HTML parser. Models like OpenAI's o3 and Claude 3.7 Sonnet (Thinking), which made more frequent use of these tools, generally performed better. In contrast, models like Llama 4 Maverick often skipped tool usage entirely, producing outputs without conducting any research—and showed correspondingly weak results.

But heavy tool use wasn't always a sign of better performance. GPT-4o Mini, which made the most tool calls, still delivered low accuracy due to consistent errors in formatting and task sequencing. Llama 4 Maverick, by contrast, regularly gave responses without conducting searches at all.

In some cases, processing a single query cost over $5. OpenAI's o1 model stood out as particularly inefficient: it had low accuracy and high costs. In practical applications, these expenses would need to be weighed against the cost of human labor.

Scatter plot: Analysis of costs per query ($) vs. accuracy (%) of different AI models in the financial benchmark.
OpenAI's o3 model topped the benchmark with 48.3% accuracy, but also had the highest per-query cost at $3.69. Claude 3.7 Sonnet performed similarly—around 43% to 44% accuracy—for just $1 per query. OpenAI's o1 had the worst cost-performance ratio: $1.50 per query for about 20% accuracy. | Source: Vals.ai

Model performance varied widely. In one task focused on Netflix’s Q4 2024 share buybacks, Claude 3.7 Sonnet (Thinking) and Gemini 2.5 Pro returned accurate, source-backed answers. GPT-4o and Llama 3.3, on the other hand, either missed the relevant information or gave incorrect responses. These inconsistencies highlight the ongoing need for human oversight in areas like prompt engineering, system setup, and internal benchmarking.

A striking gap between investment and real-world readiness

Vals.ai concludes that today's AI agents are capable of handling simple but time-consuming tasks, yet remain unreliable for use in sensitive and highly regulated sectors like finance. The models still struggle with complex, context-heavy tasks and cannot currently serve as the sole basis for decision-making.

Recommendation

While the models can extract basic data from documents, they fall short when deeper financial reasoning is required—making them ill-suited to fully replace human analysts.

"The data reveals a striking gap between investment & readiness. Today's agents can fetch numbers but stumble on the crucial financial reasoning needed to truly augment analyst work and unlock value in this space," the company writes.

The benchmark framework is available open source via GitHub, though the test dataset remains private to prevent targeted training. A full breakdown of the benchmark results is available on the Vals.ai website.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new benchmark from Vals.ai finds that AI agents are not yet suitable for complex financial analyses, with OpenAI's o3 model achieving only 48.3 percent accuracy at an average cost of $3.69 per answer.
  • While models performed better on simple tasks like data extraction, with up to 38 percent accuracy, they mostly failed at more challenging analyses or forecasts. Using external tools such as search engines improved outcomes but did not ensure correct answers.
  • The testers conclude that AI agents can assist with straightforward, repetitive tasks, but they lack the deep financial understanding needed for dependable use in finance.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.