GPT-4 Code Interpreter smashes maths benchmarks, hits new SOTA

Midjourney prompted by THE DECODER

Researchers have achieved a new SOTA on the MATH benchmark with GPT-4 Code Interpreter using new prompting methods.

The team tested OpenAI's latest AI system, GPT-4 Code Interpreter (GPT4-Code), on mathematical reasoning datasets such as MATH. The dataset is recognized as the most challenging mathematical problem set, the researchers said.

They found that GPT4-Code significantly outperformed GPT-4, achieving 69.7% accuracy on MATH compared to GPT-4's 42.2%, and also significantly surpassed the previous state-of-the-art performance of 53.9%.

Using different prompts, the team introduced different constraints on the frequency of code usage allowed in the GPT4 Code and found "that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs."

Two methods greatly enhance the mathematical capabilities of the GPT-4 Code Interpreter

Based on these findings, the researchers aimed to further improve the GPT4 Code's mathematical capabilities by pushing for more frequent code execution, as this improves performance, especially on harder problems.

They proposed two methods:

Explicit Code-Based Self-Verification
- This prompts the GPT-4 Code Interpreter to verify its answer with code. If it is wrong, it will keep trying until the verification is successful.
Verification-Guided Weighted Majority Voting
- This incorporates the verification results into majority voting. Answers verified as true are given higher weights, reflecting greater confidence.

Experiments showed that these methods improved the accuracy of MATH to 84.3%, surpassing the GPT4-Code's 69.7%. The key was exploiting their self-verification capabilities through increased code usage.

Methods could produce more accurate datasets for open-source models such as LLaMA 2

The researchers also evaluated their technique on the MMLU benchmark for math and science problems. Again, it improved the accuracy of the GPT-4 Code Interpreter across all datasets, demonstrating broad applicability.

The team now want to apply their findings on the important role of code usage frequency and their two methods to other LLMs outside GPT-4. They also plan to use them to create more accurate datasets that would include "detailed step-by-step code-based solution generation and code-based validation, which could help improve open-source LLMs like LLaMA 2".

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

GPT-4 Code Interpreter smashes maths benchmarks, hits new SOTA

Two methods greatly enhance the mathematical capabilities of the GPT-4 Code Interpreter

Methods could produce more accurate datasets for open-source models such as LLaMA 2

So-called reasoning models are more efficient but not more capable than regular LLMs, study finds

Researchers reveal that AI models have distinct strategic fingerprints in classic game theory tests

Sakana AI's new algorithm lets large language models work together to solve complex problems

Alibaba's new GPT-4o competitor Qwen VLo is no longer open source

"Cat attack" on reasoning model shows how important context engineering is

Apple's claims about large reasoning models face fresh scrutiny from a new study

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

GPT-4 Code Interpreter smashes maths benchmarks, hits new SOTA

Two methods greatly enhance the mathematical capabilities of the GPT-4 Code Interpreter

Methods could produce more accurate datasets for open-source models such as LLaMA 2

Share

Bank details