Content
summary Summary

Researchers have put leading AI models through a new kind of test—one that measures how well they can reason their way to a courtroom victory. The results highlight some clear differences in both performance and cost.

Ad

A team from the Hao AI Lab at the University of California San Diego evaluated current language models using "Phoenix Wright: Ace Attorney," a game that requires players to collect evidence, spot contradictions, and expose the truth behind lies.

According to Hao AI Lab, Ace Attorney is particularly suitable for this test because it requires players to collect evidence, uncover contradictions and uncover the truth behind lies. The models had to sift through long conversations, spot inconsistencies during cross-examination, and select the appropriate evidence to challenge witness statements.

The experiment was partly inspired by OpenAI co-founder Ilya Sutskever, who once compared next-word prediction to understanding a detective story. Sutskever recently secured additional multi-billion-euro funding for a new AI venture.

Ad
Ad

o1 leads, Gemini follows

The researchers tested several top multimodal and reasoning models, including OpenAI o1, Gemini 2.5 Pro, Claude 3.7-thinking, and Llama 4 Maverick. Both o1 and Gemini 2.5 Pro advanced to level 4, but o1 came out ahead on the toughest cases.

Horizontal bar chart: Comparison of 8 AI language models in the Ace Attorney Performance Test, scores from 0-26.
With scores of 26 and 20, the o1-2024-12-17 and Gemini-2.5-Pro models achieved the highest results in the Ace Attorney performance test. | Image: Hao AI Lab

The test goes beyond simple text or image analysis. As the team explains, models have to search through long contexts and recognize contradictions in them, understand visual information precisely and make strategic decisions in the course of the game.

"Game design pushes AI beyond pure textual and visual tasks by requiring it to convert understanding into context-aware actions. It is harder to overfit because success here demands reasoning over context-aware action space - not just memorization," the researchers explain.

Overfitting occurs when a language model memorizes its training data—including all randomness and errors—so it performs poorly on new, unfamiliar examples. This issue also arises with reasoning models optimized for math and code tasks. These models may become more efficient at finding the correct solutions, but they also reduce the diversity of paths considered.

Gemini 2.5 Pro offers better price-performanc

Gemini 2.5 Pro turned out to be significantly more cost-efficient than the other models tested. Hao AI Lab reports that it is six to fifteen times cheaper than o1, depending on the case. In one particularly lengthy Level 2 scenario, o1 incurred costs exceeding $45.75, while Gemini 2.5 Pro completed the task for $7.89.

Recommendation

Gemini 2.5 Pro also outperformed GPT-4.1—which is not specifically optimized for reasoning—in terms of cost, at $1.25 per million input tokens compared to $2 for GPT-4.1. The researchers note, however, that the actual costs could be slightly higher due to image processing requirements.

Radar chart compares AI model performance in 6 games (2048, Sokoban, Super Mario, Ace Attorney, Tetris), with a performance scale of 0-100.
In the Game Arena benchmark, Hao AI Lab has already compared current language models on games such as 2048, Tetris, Sokoban, and Candy Crush. | Image: Hao AI Lab

Since February, the team has been benchmarking language models on a range of games, including Candy Crush, 2048, Sokoban, Tetris, and Super Mario. Of all the titles tested so far, Ace Attorney is likely the game with the most demanding mechanics when it comes to reasoning.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at the Hao AI Lab at the University of California San Diego evaluated AI models, including OpenAI o1 and Gemini 2.5 Pro, by having them play "Phoenix Wright: Ace Attorney," a game that involves spotting contradictions and presenting appropriate evidence.
  • Both models successfully handled the most challenging stages, but the o1 model was slightly more capable overall. Gemini 2.5 Pro, however, was much more cost-effective, completing a lengthy case for about $8 compared to over $45 for o1.
  • The researchers highlight that the game is a strong test for AI systems because it demands not only reading and image analysis, but also strategy and making logical connections.
Sources
Jonathan writes for THE DECODER about how AI tools can make our work and creative lives better.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.