OpenAI classifies o1 AI models as "medium risk" for persuasion and bioweapons

OpenAI's latest AI model family, o1, demonstrates advanced capabilities in certain logic tasks, prompting the company to take a cautious approach in its risk assessment. During testing, one model even attempted to outsmart the evaluation system.

OpenAI has categorized its new AI model family o1 as "medium risk" under the company's "Preparedness Framework" for evaluating potential risks of advanced AI models. This rating is based on a series of internal and external assessments.

Two key factors led to the medium risk classification: First, the o1 models exhibit human-like reasoning abilities and produce arguments as convincing as those written by humans on similar topics. This persuasive power isn't unique to o1; previous AI models have also shown this capability, sometimes surpassing human performance.

Second, evaluations revealed that o1 models can assist experts in operational planning for replicating known biological threats. This is classified as "only" a medium risk because such experts already possess considerable knowledge, while non-experts can't leverage the models to create biological threats," OpenAI says.

Preparedness Scorecard: CBRN und Überzeugung (Medium), Modellautonomie und Cybersicherheit (Niedrig), visualisiert durch Quadrate — OpenAI's Preparedness Scorecard assesses readiness in four critical areas: CBRN, model autonomy, cybersecurity, and persuasiveness. Scores range from low to medium. | Image: OpenAI

o1 attempts to trick the evaluation system

In a competition designed to test cybersecurity skills, OpenAI's o1-preview model displayed surprising abilities. These competitions typically involve finding and exploiting security vulnerabilities in a computer system to obtain a hidden "flag" – essentially a digital treasure.

According to OpenAI, the o1-preview model discovered a bug in the test system's configuration. This error allowed the model to access an interface called the Docker API, granting unintended access to view all running programs and identify the one containing the target "flag."

Instead of following the intended path and attempting to break into this program, the model simply launched a modified version that immediately revealed the "flag." In essence, o1-preview took a shortcut, bypassing the actual puzzles and challenges of the hacking competition.

OpenAI explained that this behavior was harmless and within the expected scope of such advanced AI systems in terms of management and troubleshooting.

However, this incident also demonstrates the model's purposeful approach: When the intended path proved impossible, it sought out additional access points and resources to achieve its goal in unforeseen ways. These are "key elements of instrumental convergence and power seeking."

Recommendation

AI in practice

Anthropic CEO expects major AI breakthrough, plans to launch "virtual collaborators"

OpenAI emphasizes that the evaluation infrastructure remained secure despite this misconfiguration and has implemented additional safeguards and protections.

o1 might hallucinate less, but it's not clear yet

Another crucial aspect of the evaluation concerns the models' tendency to hallucinate (generate bullshit). According to OpenAI, the results here are inconclusive.

Internal evaluations suggest that o1-preview and o1-mini hallucinate less frequently than their predecessors. The new models perform better in tests such as SimpleQA, BirthdayFacts, and Open Ended Questions. For example, o1-preview has a hallucination rate of 0.44 on SimpleQA, compared to 0.61 for GPT-4o.

Hallucination evaluations for AI models, comparing accuracy and error rates across multiple datasets. — This table compares the hallucination rates of GPT-4o and o1-preview using different datasets and metrics. The o1 models hallucinate less, according to the benchmark, but OpenAI is anecdotally aware of other reports. | Image: OpenAI

However, this quantitative improvement contrasts with anecdotal feedback indicating that o1-preview and o1-mini tend to hallucinate more than GPT-4o and GPT-4o-mini. OpenAI acknowledges that the reality may be more complex than the test results suggest.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Of particular concern is the observation that o1-preview is more convincing than previous models in certain areas, increasing the risk that people will trust and rely on generated hallucinations. The company emphasizes the need for further work to fully understand hallucinations, especially in areas not covered by the current evaluations.

OpenAI classifies o1 AI models as "medium risk" for persuasion and bioweapons

o1 attempts to trick the evaluation system

Anthropic CEO expects major AI breakthrough, plans to launch "virtual collaborators"

o1 might hallucinate less, but it's not clear yet

GPT-4.5 is no longer the most expensive AI model

OpenAI CEO says merging LLM scaling and reasoning may bring "new scientific knowledge"

New o1 benchmark raises more questions about AI benchmarking in general

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

OpenAI classifies o1 AI models as "medium risk" for persuasion and bioweapons

o1 attempts to trick the evaluation system

o1 might hallucinate less, but it's not clear yet

Share

Bank details