Content
summary Summary

A new study shows that today's reasoning models can pass the grueling financial analyst test. Gemini 3.0 Pro set a record with a score of 97.6 percent at Level I.

Ad

The Chartered Financial Analyst (CFA) certification is widely considered one of finance's toughest qualifications. The three-stage exam tests progressively complex skills, ranging from fundamental knowledge to application, analysis, and complex portfolio construction.

In 2023, the leading language models of the time could already answer some questions on the CFA exam. However, performance was mixed. ChatGPT (3.5) failed Levels I and II. GPT-4 managed to pass Level I but failed Level II. Eventually, GPT-4o—operating as a pure language model—succeeded in passing all three levels.

A new study from researchers at Columbia University, Rensselaer Polytechnic Institute, and the University of North Carolina shows that the current generation of reasoning models passes all three levels, sometimes with near-perfect scores.

Ad
Ad

Researchers put six reasoning models through 980 exam questions: three Level I exams with 540 multiple-choice questions, two Level II exams with 176 case-based questions, and three Level III exams with 264 questions, including open-answer formats. The result: Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1 passed every level based on established criteria.

Gemini and GPT-5 lead the pack

Gemini 3.0 Pro hit a record 97.6 percent on Level I, the foundational test consisting of independent multiple-choice questions. GPT-5 followed at 96.1 percent, with Gemini 2.5 Pro at 95.7 percent. Even the weakest model tested, DeepSeek-V3.1, scored 90.9 percent.

GPT-5 took the lead on Level II, which tests application and analysis through case studies, scoring 94.3 percent. Gemini 3.0 Pro reached 93.2 percent and Gemini 2.5 Pro 92.6 percent. The researchers noted that models achieved "nearly perfect results" here. Ethics proved to be a stumbling block. Researchers reported relative error rates of 17 to 21 percent at Level II, even for the top-performing models.

On Level III—the most complex stage combining multiple-choice with open responses—Gemini 2.5 Pro performed best on multiple-choice questions at 86.4 percent. However, Gemini 3.0 Pro dominated the constructed responses with 92.0 percent, a significant jump from its predecessor's 82.8 percent.

Level Best model Result
Level I (multiple choice) Gemini 3.0 Pro 97.6%
Level II (multiple choice) GPT-5 94.3%
Level III (multiple choice) Gemini 2.5 Pro 86.4%
Level III (constructed responses) Gemini 3.0 Pro 92.0%
Overall ranking Gemini 3.0 Pro 1st place

The study uses mock CFA exams compiled from the official CFA Institute Practice Pack (Levels I and II) and AnalystPrep mock exams (Level III). Levels I and II used official material, while Level III used third-party mock exams to maintain comparability with previous research.

Recommendation

An o4-mini model automated the grading of open answers. The study notes this introduces measurement errors and a possible "verbosity bias" where detailed answers get higher scores. Consequently, the results serve as model-based approximations.

Pass thresholds were drawn from previous work: Level I requires at least 60 percent per topic and 70 percent overall. Level II needs at least 50 percent per topic and 60 percent overall. Level III requires an average of at least 63 percent across multiple-choice and constructed-response sections.

Passing a test doesn't mean doing the job

The researchers say the results suggest "reasoning models surpass the expertise required of entry-level to mid-level financial analysts and may achieve senior-level financial analyst proficiency in the future." While LLMs had already mastered the "codified knowledge" of Levels I and II, the latest generation is now developing the complex synthesis skills required for Level III.

The usual caveats apply. Benchmarks—especially multiple-choice formats—only hint at performance and potential economic impact. Passing a test doesn't mean a model can handle the daily grind of a financial analyst, which involves client meetings, assessing market sentiment, and making decisions with incomplete information.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The study also notes that models still struggle most with ethical questions, which often require contextual understanding and judgment. Exams test isolated knowledge, not the ability to apply it in complex, changing real-world situations.

The researchers also can't rule out data contamination. Although they used current, paid materials, questions might have leaked into training data through paraphrased content in public datasets. This means there is a chance the models simply knew the answers rather than reasoning through them.

Still, the leap from "failed" to "almost perfect" in just two years highlights the rapid advance of AI in specialized domains. For the financial sector, the question, it seems, is no longer whether AI can master the material, but how to integrate that knowledge into actual workflows.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Six reasoning models were tested on the CFA exam, a notoriously difficult certification for finance professionals, and all passed all three levels.
  • Gemini 3.0 Pro led the pack on Level I with a score of 97.6 percent, while GPT-5 came out on top for Level II, scoring 94.3 percent.
  • Despite their strong overall performance, the models consistently stumbled on ethics questions, with error rates between 17 and 21 percent, and researchers note it's unclear whether some exam material appeared in the models' training data.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.