Content
summary Summary

OpenAI's new AI models o1-preview and o1-mini have achieved top scores in chatbot rankings, though the low number of ratings could skew the results.

Ad

According to a published overview, o1-preview took first place across all evaluated categories, including overall performance, safety, and technical capabilities. O1-mini, which specializes in STEM tasks, briefly shared second place overall with a GPT-4o version released in early September and leads in technical areas.

The Chatbot Arena, a platform for comparing AI models, evaluated the new OpenAI systems using over 6,000 community ratings. This showed o1-preview and o1-mini excelling especially in math tasks, complex prompts, and programming.

However, o1-preview and o1-mini have received far fewer votes than established models like GPT-4o or Anthropic's Claude 3.5, with just under 3,000 reviews each. This small sample size could distort the evaluation and limit the significance of the results.

Ad
Ad

OpenAI's o1 shows strong performance in math and coding

O1 aims to set a new standard for AI reasoning by "thinking" longer before answering. However, the o1 models don't outperform GPT-4o in all areas. Many tasks don't require complex logical reasoning, and sometimes a quick response from GPT-4o is sufficient.

Picture: Lmsys

A chart from Lmsys on model strength in mathematics clearly shows o1-preview and o1-mini scoring over 1360, well above other models' performance.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI's new AI models, o1-preview and o1-mini, achieve top scores in various categories in the chatbot arena. o1-preview ranks first in all areas evaluated, while o1-mini performs particularly well in technical tasks.
  • The performance of the models was evaluated on the basis of more than 6,000 community ratings. The strengths of o1-preview and o1-mini were particularly evident in mathematical tasks, complex prompts and programming.
  • It should be noted, however, that the new models have received significantly fewer ratings than established systems such as GPT-4o or Claude 3.5. This small sample size may limit the validity of the results and lead to bias.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.