Content
summary Summary

A new study comparing public and private datasets shows how OpenAI's o1 model approaches mathematical problem-solving. The research aims to determine whether the AI uses actual logical reasoning or simply recalls memorized solutions.

Ad

According to the "OpenAI-o1 AB Testing" study, researchers found no significant evidence that the o1-mini model relies mainly on memorization when solving math problems. The team compared the model's performance on publicly available Mathematical Olympiad problems against similar but less well-known training problems used by the Chinese national team.

However, the researchers note that the tested o1-mini model has weaknesses in formulating detailed mathematical proofs. Instead, it often uses a trial-and-error approach and finds solutions through informal reasoning and heuristic "guesswork".

In "Search" type of questions, where certain combinations of numbers or expressions have to be found, the model frequently fails to prove why no other solutions exist. It is limited to verifying the solutions found. The study refers only to o1-mini, not to the recently released versions o1 and o1-pro.

Ad
Ad

Shades of gray instead of black and white

The researchers consider it positive that o1 shows a similar mathematical intuition to humans in many tasks. The model can typically recognize the correct solution path and identify important intermediate steps, even if the formal elaboration remains incomplete.

The study shows that although o1-mini is not a perfect mathematical problem solver, it does seem to have some reasoning abilities. According to the team, the consistent performance across different data sets disproves the assumption that the model works mainly through memorization.

Shortly after the release of o1-mini and o1-preview, a mathematician reported how the system was able to support him. In October, a study by Princeton University and Yale University investigated which factors influence the performance of language models when solving tasks with chain-of-thought (CoT) prompts. CoT is a central component of o1 training and inference. According to the study, the models use probabilities and memorization - but also a "probabilistic version of genuine reasoning".

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new study examines OpenAI's o1-mini model and shows that it is unlikely to rely on memorization for its solutions. This was demonstrated by comparing public Maths Olympiad tasks with private training tasks.
  • The model achieved similar results for both data sets: around 70 percent accuracy for search tasks and 21 percent for calculation tasks. However, the system shows weaknesses when formulating detailed mathematical proofs and uses a trial-and-error approach instead.
  • The researchers observed that o1-mini exhibits human-like mathematical intuition and can recognize important intermediate steps, even if the formal elaboration remains incomplete. The study refers only to o1-mini, not to the later released versions o1-full version and o1-pro.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.