Content
summary Summary

Researchers have achieved a new SOTA on the MATH benchmark with GPT-4 Code Interpreter using new prompting methods.

The team tested OpenAI's latest AI system, GPT-4 Code Interpreter (GPT4-Code), on mathematical reasoning datasets such as MATH. The dataset is recognized as the most challenging mathematical problem set, the researchers said.

They found that GPT4-Code significantly outperformed GPT-4, achieving 69.7% accuracy on MATH compared to GPT-4's 42.2%, and also significantly surpassed the previous state-of-the-art performance of 53.9%.

Bild: Zhou, Wang, Lu, Shi, Luo et al.

Using different prompts, the team introduced different constraints on the frequency of code usage allowed in the GPT4 Code and found "that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs."

Ad
Ad

Two methods greatly enhance the mathematical capabilities of the GPT-4 Code Interpreter

Based on these findings, the researchers aimed to further improve the GPT4 Code's mathematical capabilities by pushing for more frequent code execution, as this improves performance, especially on harder problems.

They proposed two methods:

  • Explicit Code-Based Self-Verification
    • This prompts the GPT-4 Code Interpreter to verify its answer with code. If it is wrong, it will keep trying until the verification is successful.
  • Verification-Guided Weighted Majority Voting
    • This incorporates the verification results into majority voting. Answers verified as true are given higher weights, reflecting greater confidence.
Bild: Zhou, Wang, Lu, Shi, Luo et al.

Experiments showed that these methods improved the accuracy of MATH to 84.3%, surpassing the GPT4-Code's 69.7%. The key was exploiting their self-verification capabilities through increased code usage.

Methods could produce more accurate datasets for open-source models such as LLaMA 2

The researchers also evaluated their technique on the MMLU benchmark for math and science problems. Again, it improved the accuracy of the GPT-4 Code Interpreter across all datasets, demonstrating broad applicability.

The team now want to apply their findings on the important role of code usage frequency and their two methods to other LLMs outside GPT-4. They also plan to use them to create more accurate datasets that would include "detailed step-by-step code-based solution generation and code-based validation, which could help improve open-source LLMs like LLaMA 2".

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers tested OpenAI's GPT-4 Code Interpreter on the challenging MATH benchmark and achieved a new state-of-the-art accuracy of 69.7%, far surpassing GPT-4's 42.2%.
  • They enhanced the AI system's mathematical capabilities even more by using two methods: Explicit Code-Based Self-Verification and Verification-Guided Weighted Majority Voting, raising the MATH accuracy to a new SOTA of 84.3%.
  • Experiments show the methods have broad applicability, and the team plans to apply their findings to other large language models and create more accurate datasets for open-source AI systems.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.