Language models are a poor choice for answering financial questions, study says

Are large language models fit for the financial world? A new study shows that even the best LLMs struggle to correctly answer questions about financial data.

Researchers at Patronus AI have found that large language models, such as OpenAI's GPT-4 Turbo or Claude 2, often fail to answer questions about SEC filings. This underscores the challenges that regulated industries such as finance face when implementing AI models in customer service or research processes.

Poor accuracy across all models tested

Patronus AI tested four language models for answering questions about corporate financial reports: OpenAI's GPT-4 and GPT-4 Turbo, Anthropics Claude 2, and Metas Llama 2. The best model, GPT-4 Turbo, achieved only 79 percent accuracy in Patronus AI's test, even though almost the entire report was included in the prompt.

The models often refused to answer questions or made up facts and figures that were not included in the SEC filings. According to Anand Kannappan, co-founder of Patronus AI, this performance is unacceptable and should be much higher for use in automated and production-ready applications.

FinanceBench tests 10,000 financial questions

Patronus AI developed FinanceBench for the test, a dataset of more than 10,000 questions and answers from SEC filings of major public companies. The dataset includes the correct answers and their exact location in the reports. Some answers require simple mathematical or logical reasoning.

Has CVS Health paid dividends to common shareholders in Q2 of FY2022?

Did AMD report customer concentration in FY22?

What is Coca Cola’s FY2021 COGS % margin? Calculate what was asked by utilizing the line items clearly shown in the income statement.

Questions from FinanceBench

The researchers believe that AI models have significant potential to assist the financial sector if they continue to improve. However, current performance levels show that humans still need to be involved in the workflow to support and manage the process.

One possible solution could be to make AI systems more familiar with the specific task through improved prompting. This could increase their ability to extract relevant information. However, the question remains whether such approaches can solve the problem in general or only in certain scenarios.

In its usage guidelines, OpenAI rules out using an OpenAI model to provide individual financial advice without a qualified person reviewing the information.

Known problem with no new solution

Language models with large context windows are known to have difficulty extracting information reliably, especially from the middle of long texts. This phenomenon, known as "lost in the middle", raises questions about the usefulness of large context windows for LLMs.

Recommendation

AI in practice

Update

OpenAI has a "highly accurate" ChatGPT text detector, but won't release it for now

Anthropic has recently developed a method to solve the "lost in the middle" problem of its Claude 2.1 AI model by prefacing the model's answer with the sentence "This is the most relevant sentence in the context:". Tests need to show whether this method scales reliably to many tasks and provides a similar improvement for other LLMs such as GPT-4 (Turbo).

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Language models are a poor choice for answering financial questions, study says

Poor accuracy across all models tested

FinanceBench tests 10,000 financial questions

Known problem with no new solution

OpenAI has a "highly accurate" ChatGPT text detector, but won't release it for now

Google DeepMind open-sources AI text watermarking for Gemini

Microsoft's RUBICON tells if your AI coding buddy is actually helping or just slacking off

Language models like GPT-4 memorize more than they reason, study finds

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Language models are a poor choice for answering financial questions, study says

Poor accuracy across all models tested

FinanceBench tests 10,000 financial questions

Known problem with no new solution

Share

Bank details