A new study comparing public and private datasets shows how OpenAI's o1 model approaches mathematical problem-solving. The research aims to determine whether the AI uses actual logical reasoning or simply recalls memorized solutions.
According to the "OpenAI-o1 AB Testing" study, researchers found no significant evidence that the o1-mini model relies mainly on memorization when solving math problems. The team compared the model's performance on publicly available Mathematical Olympiad problems against similar but less well-known training problems used by the Chinese national team.
However, the researchers note that the tested o1-mini model has weaknesses in formulating detailed mathematical proofs. Instead, it often uses a trial-and-error approach and finds solutions through informal reasoning and heuristic "guesswork".
In "Search" type of questions, where certain combinations of numbers or expressions have to be found, the model frequently fails to prove why no other solutions exist. It is limited to verifying the solutions found. The study refers only to o1-mini, not to the recently released versions o1 and o1-pro.
Shades of gray instead of black and white
The researchers consider it positive that o1 shows a similar mathematical intuition to humans in many tasks. The model can typically recognize the correct solution path and identify important intermediate steps, even if the formal elaboration remains incomplete.
The study shows that although o1-mini is not a perfect mathematical problem solver, it does seem to have some reasoning abilities. According to the team, the consistent performance across different data sets disproves the assumption that the model works mainly through memorization.
Shortly after the release of o1-mini and o1-preview, a mathematician reported how the system was able to support him. In October, a study by Princeton University and Yale University investigated which factors influence the performance of language models when solving tasks with chain-of-thought (CoT) prompts. CoT is a central component of o1 training and inference. According to the study, the models use probabilities and memorization - but also a "probabilistic version of genuine reasoning".