Teams competing in the 2024 ARC Prize made significant progress on the challenging AI benchmark, with performance jumping from 33 to 55.5 percent on private evaluation tests. However, the target score of 85 percent remains well out of reach.
The AI benchmark ARC-AGI, considered a key measure for artificial general intelligence (AGI), experienced a significant performance leap in 2024. According to the technical report for the ARC Prize 2024, the best performance on the private evaluation set increased from 33 to 55.5 percent.
"ARC-AGI remains undefeated – still by a considerable margin," the report's authors note. They point out that even in 2020, simple brute-force program search could theoretically achieve 49 percent.
The $600,000 main prize for reaching 85 percent went unclaimed. While team MindsAI achieved the highest score of 55.5 percent, they did not release their code and were therefore ineligible for the prize.
Three new approaches dominate the competition
According to the report, three main approaches emerged: AI-assisted program synthesis, test-time training (TTT), and combinations of both methods.
In program synthesis, systems use large language models to generate program code or guide program search. Ryan Greenblatt achieved 42 percent by having GPT-4o generate and debug thousands of Python programs per task.
Test-time training adapts a pre-trained language model to the specific task at runtime. This approach, introduced by MindsAI, was adopted by many teams. The winning team "the ARChitects" achieved 53.5 percent using this method.
Teams combining both approaches achieved the best results. Pure program synthesis or pure test-time training each only reach about 40 percent. A Massachusetts Institute of Technology team recently demonstrated in a paper how a language model combining both approaches achieved 61.9 percent accuracy on the public test of the ARC Prize. Due to computational power restrictions in the ARC Prize that the team exceeded, this approach hasn't been tested yet on the private leaderboard, which contains 100 unpublished tasks. Results for OpenAI's complete o1 model including Pro mode are also pending, though a dramatic performance increase isn't expected.
New benchmark planned for 2025
The ARC Prize organizers also announced plans to develop a new benchmark called ARC-AGI-2 for 2025. The current dataset from 2019 has several weaknesses: With only 100 tasks in the private evaluation set, it's too small, and thousands of evaluations create a risk of overfitting.
"We aspire to grow ARC Prize from its experimental origins into a durable north star for AGI," the report states. The competition has already prompted at least seven well-funded AI startups to change their roadmaps and focus on solving the benchmark, the team writes. Larger companies have also begun internal efforts to crack the benchmark.
The ARC Prize will continue annually until the benchmark is solved and a public reference solution is available. The organizers believe that the team that eventually develops AGI is already working on ARC-AGI today.
New ideas needed
The ARC-AGI benchmark is significant, according to the report's authors, because it tests the ability to solve completely novel tasks - as opposed to tasks a system can prepare for. Classical deep learning approaches based on retrieving stored patterns fail at this.
The progress in 2024 shows that algorithmic improvements can have a major impact, and massive computing power isn't necessarily required. However: "New ideas are still needed to build AGI. The fact that ARC-AGI survived five months of intense scrutiny with an outstanding $600,000 grand prize and hundreds of thousands of dollars in additional prizes is strong evidence that the solution does not yet exist."