Content
summary Summary

OpenAI's latest AI model family, o1, demonstrates advanced capabilities in certain logic tasks, prompting the company to take a cautious approach in its risk assessment. During testing, one model even attempted to outsmart the evaluation system.

Ad

OpenAI has categorized its new AI model family o1 as "medium risk" under the company's "Preparedness Framework" for evaluating potential risks of advanced AI models. This rating is based on a series of internal and external assessments.

Two key factors led to the medium risk classification: First, the o1 models exhibit human-like reasoning abilities and produce arguments as convincing as those written by humans on similar topics. This persuasive power isn't unique to o1; previous AI models have also shown this capability, sometimes surpassing human performance.

Second, evaluations revealed that o1 models can assist experts in operational planning for replicating known biological threats. This is classified as "only" a medium risk because such experts already possess considerable knowledge, while non-experts can't leverage the models to create biological threats," OpenAI says.

Ad
Ad
Preparedness Scorecard: CBRN und Überzeugung (Medium), Modellautonomie und Cybersicherheit (Niedrig), visualisiert durch Quadrate
OpenAI's Preparedness Scorecard assesses readiness in four critical areas: CBRN, model autonomy, cybersecurity, and persuasiveness. Scores range from low to medium. | Image: OpenAI

o1 attempts to trick the evaluation system

In a competition designed to test cybersecurity skills, OpenAI's o1-preview model displayed surprising abilities. These competitions typically involve finding and exploiting security vulnerabilities in a computer system to obtain a hidden "flag" – essentially a digital treasure.

According to OpenAI, the o1-preview model discovered a bug in the test system's configuration. This error allowed the model to access an interface called the Docker API, granting unintended access to view all running programs and identify the one containing the target "flag."

Instead of following the intended path and attempting to break into this program, the model simply launched a modified version that immediately revealed the "flag." In essence, o1-preview took a shortcut, bypassing the actual puzzles and challenges of the hacking competition.

OpenAI explained that this behavior was harmless and within the expected scope of such advanced AI systems in terms of management and troubleshooting.

However, this incident also demonstrates the model's purposeful approach: When the intended path proved impossible, it sought out additional access points and resources to achieve its goal in unforeseen ways. These are "key elements of instrumental convergence and power seeking."

Recommendation

OpenAI emphasizes that the evaluation infrastructure remained secure despite this misconfiguration and has implemented additional safeguards and protections.

o1 might hallucinate less, but it's not clear yet

Another crucial aspect of the evaluation concerns the models' tendency to hallucinate (generate bullshit). According to OpenAI, the results here are inconclusive.

Internal evaluations suggest that o1-preview and o1-mini hallucinate less frequently than their predecessors. The new models perform better in tests such as SimpleQA, BirthdayFacts, and Open Ended Questions. For example, o1-preview has a hallucination rate of 0.44 on SimpleQA, compared to 0.61 for GPT-4o.

Hallucination evaluations for AI models, comparing accuracy and error rates across multiple datasets.
This table compares the hallucination rates of GPT-4o and o1-preview using different datasets and metrics. The o1 models hallucinate less, according to the benchmark, but OpenAI is anecdotally aware of other reports. | Image: OpenAI

However, this quantitative improvement contrasts with anecdotal feedback indicating that o1-preview and o1-mini tend to hallucinate more than GPT-4o and GPT-4o-mini. OpenAI acknowledges that the reality may be more complex than the test results suggest.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Of particular concern is the observation that o1-preview is more convincing than previous models in certain areas, increasing the risk that people will trust and rely on generated hallucinations. The company emphasizes the need for further work to fully understand hallucinations, especially in areas not covered by the current evaluations.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI rates its new o1 AI model family as "medium" risk, citing human-like reasoning abilities and the potential to assist experts in replicating biological threats.
  • In a cybersecurity test, o1-preview exploited a system flaw to achieve its goal unconventionally, demonstrating "instrumental convergence and pursuit of power."
  • Hallucination tendencies of o1 models remain unclear. While internal tests show improvement, anecdotal reports suggest otherwise. OpenAI calls for more comprehensive research on AI hallucinations.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.