Ad
Skip to content

AI models can barely control their own reasoning, and OpenAI says that's a good sign

With GPT-5.4 Thinking, OpenAI is reporting on “CoT controllability” for the first time – a measure of whether AI models can deliberately manipulate their own reasoning. An accompanying study finds that reasoning models almost universally fail at this task, which OpenAI says is encouraging for AI safety.

Moltbook's alleged AI civilization is just a massive void of bloated bot traffic

Over 2.6 million AI agents interact on Moltbook with zero human involvement. They post, comment, and vote, but a new study shows they never learn from each other. It’s hollow interaction without mutual influence, shared memory, or social structures, a new study finds.

Current language model training leaves large parts of the internet on the table

Large language models learn from web data, but which pages actually make it into training sets depends heavily on a seemingly mundane choice: the HTML extractor. Researchers at Apple, Stanford, and the University of Washington found that three common extraction tools pull surprisingly different content from the same web pages.

Anthropic can't stop humanizing its AI models, now Claude Opus 3 gets a retirement blog

Anthropic is retiring its Claude Opus 3 AI model and letting it publish weekly essays on Substack. The company says it conducted “retirement interviews” to ask the model about its wishes, and it “enthusiastically” agreed. The move is a prime example of how AI companies keep pushing the humanization of their products, blurring the line between philosophical caution and PR stagecraft.

Deepmind suggests AI should occasionally assign humans busywork so we do not forget how to do our jobs

AI systems should sometimes give tasks to humans they could easily handle themselves, just so people don’t forget how to do their jobs. That’s one of the more striking recommendations from a new Google Deepmind paper on how AI agents should delegate work.

Read full article about: OpenAI wants to retire the AI coding benchmark that everyone has been competing on

OpenAI says the SWE-bench Verified programming benchmark has lost its value as a meaningful measure of AI coding ability. The company points to two main problems: at least 59.4 percent of the benchmark's tasks are flawed, rejecting correct solutions because they enforce specific implementation details or check functions not described in the task.

Many tasks and solutions have also leaked into leading models' training data. OpenAI reports that GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash Preview could reproduce some original fixes from memory, meaning benchmark progress increasingly reflects what a model has seen, not how well it codes. OpenAI recommends SWE-bench Pro instead and is building its own non-public tests.

There's a possible strategic angle here: a "contaminated" benchmark can make rivals—especially open-source models—look better and skew rankings. SWE-bench Verified was long the gold standard for AI coding evaluation, with OpenAI, Anthropic, Google, and many Chinese open-weight models competing for small leads. AI benchmarks can provide useful signal, but their real-world value remains limited.