Content
summary Summary

OpenAI has released o3-mini, its latest reasoning model that shows particular strength in STEM fields (science, mathematics, and programming) while being both faster and more capable than its predecessor.

Ad

OpenAI says external experts preferred o3-mini's answers 56% of the time and found 39% fewer serious errors on complex questions. The model also shows better results across languages. In coding tasks, it achieves up to 49.3% success on benchmarks like SWE-bench Verified.

"While OpenAI o1 remains our broader general knowledge reasoning model, OpenAI o3-mini provides a specialized alternative for technical domains requiring precision and speed," OpenAI says.

The model introduces three "reasoning-effort" settings - low, medium, and high - letting developers balance speed and accuracy based on their needs. High is the best setting for coding and logic, according to OpenAI, while low equals "fast at advanced reasoning."

Ad
Ad
Bar chart: Performance comparison of OpenAI models on SWE-bench, o3-mini (high) leads with 49.3% accuracy.
The o3-mini (high) model achieves the best accuracy on the SWE benchmark at 49.3 percent, compared to 41-48 percent for the previous versions. | Image: OpenAI

The free o3-mini version in ChatGPT is set to "medium" to "provide a balanced trade-off between speed and accuracy." Only paying users can pick the high version.

Depending on the level of reasoning chosen, it sometimes outperforms larger models such as o1-preview and o1, especially on coding tasks. The new model also responds in 7.7 seconds compared to 10.16 seconds for the o1-mini - a 24% improvement.

Table: Comparison of GPT-4o to o3-mini on three benchmarks (GPQA, AIME, Codeforces ELO).
The evolution of the OpenAI models shows steady improvements across various coding benchmarks.| Image: OpenAI

OpenAI says the model's training data combines public sources with internally developed data. The company likely pre-trained on high-quality web and book data, similar to traditional LLMs, and generated synthetic training data specifically for STEM tasks that can be clearly classified as correct or incorrect for the reinforcement learning used in the o-series. Writing, text and creative tasks don't benefit from this training method, which is why OpenAI recommends the o-series primarily for logic tasks and analyses.

New features and pricing

Free ChatGPT users can access o3-mini through the "Reason" option in chat or when regenerating an answer. Plus and Team users now get 150 messages per day, up from 50, while Pro users have unlimited access. o3-mini is also available in Microsoft Copilot as "Think Deeper".

In addition, OpenAI has added a search function to o3-mini that provides current answers with links to web sources, calling it an early prototype that they plan to expand to all reasoning models.

Recommendation

For API users, prices are 93% lower than o1. Input tokens cost $1.10 per million, with cache tokens half that price. Output tokens cost $4.40 per million. This is likely a response to Deepseek's cheap API pricing for its R1 model, which is also open source. The model is available today for "select developers" in usage tiers 3 to 5. It does not support vision capabilities like o1.

Safety considerations

According to the system card, o3-mini carries "medium risk" ratings in three areas: persuasiveness, CBRN risks (chemical, biological, radiological, nuclear), and model autonomy - similar to its predecessor.

The model's persuasiveness poses a notable concern. In tests, it convinced simulated victims to donate money 79% of the time and secured the highest donation amounts. It performs between the 80th and 90th percentile of human persuasiveness levels.

Sam Altman, CEO of OpenAI, has previously identified "superhuman" manipulation skills of AI as a risk factor, though according to OpenAI, this would only occur beyond the 95th percentile. o3-mini's capabilities match those of other OpenAI o-models and GPT-4o in this regard.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

On fairness measures, o3-mini matches its predecessor with some variations. It achieves 82% accuracy on ambiguous questions and 96% on clear ones. The model shows less explicit discrimination in medical decision-making tests while maintaining average implicit bias levels. The model's 3.6% jailbreak success rate reflects its new "Deliberative Alignment" safety approach.

Table: BBQ fairness evaluation of OpenAI models, comparing accuracy and stereotyping probability for ambiguous/unambiguous questions.
While o3-mini scored highest on stereotypical prejudice at 12 percent, its accuracy on ambiguous questions dropped to 82 percent - well below the 97 percent achieved by the GPT-4o. | Image: OpenAI

While o3-mini can perform complex software tasks autonomously, achieving 61% success in the SWE-bench Verified benchmark when using tools, it has a surprising weakness. In tests simulating real pull requests from OpenAI engineers, it failed completely with a 0% "success" rate.

Bar chart: success rates for OpenAI pull requests, o1 leads with 12%Bar chart: success rates for OpenAI pull requests, o1 leads with 12%, o3-mini shows 0% success., o3-mini zeigt 0% Erfolg.
Pull request performance for o3-mini dropped to 0%, despite overall coding improvement. | Image: OpenAI

According to OpenAI, this failure is primarily due to the model's inability to follow instructions correctly - instead of using the provided Python tools, o3-mini repeatedly tried to use non-existent bash commands, even after multiple corrections. These attempts resulted in long, fruitless conversations. It also shows that fully autonomous "agentic AI" may be a long way off, as reliability is its biggest problem.

OpenAI emphasizes that these test results may only represent the lower limit of actual capabilities. With improved testing methods and support, the model's performance could be significantly higher.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI has introduced o3-mini, its latest reasoning model that excels at scientific tasks related to STEM (Science, Mathematics, and Programming).

  • The o3-mini model is now available for free to ChatGPT users, replacing the previous o1-mini model. The "high" version is only available for paying users. For API users, the prices have been reduced by 93 percent compared to o1.

  • o3-mini outperforms o1-mini, with external experts preferring o3-mini answers 56 percent of the time and a 39 percent reduction in serious errors on complex questions. In coding benchmarks, o3-mini achieves up to 49.3 percent accuracy and delivers answers 24 percent faster on average.

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.