These games provide objective ways to measure skills like planning and decision-making under uncertainty. Gemini 3 Pro and Gemini 3 Flash currently hold the top spots in all rankings. The Werewolf benchmark serves double duty for security research as well: it tests whether models can detect manipulation without any real-world consequences. According to Google Deepmind CEO Demis Hassabis, the AI industry needs more rigorous tests to properly evaluate the latest models.
Mozilla is rolling out new AI settings with Firefox 148 on February 24. Users will be able to manage all the browser's generative AI features from a single location, or turn them off entirely, the company announced in a blog post.
The new settings cover translations, automatic image descriptions in PDFs, AI-powered tab grouping, link previews, and a chatbot in the sidebar. The chatbot supports services like Anthropic Claude, ChatGPT, Microsoft Copilot, Google Gemini, and Le Chat Mistral.
For users who want nothing to do with AI features, a single toggle blocks all AI extensions. Once enabled, no pop-ups or notifications about current or future AI features will appear. The settings persist through updates. Users who want to try the feature early can find it in Firefox Nightly.
OpenAI has released the Codex app for macOS, letting developers control multiple AI agents simultaneously and run tasks in parallel. According to OpenAI, it's easier to use than a terminal, making it accessible to more developers. Users can manage agents asynchronously across projects, automate recurring tasks, and connect agents to external tools via "skills." They can also review and correct work without losing context.
The Codex Mac app is available for ChatGPT Plus, Pro, Business, Enterprise, and Edu accounts. OpenAI is also doubling usage limits for paid plans. The app integrates with the CLI, IDE extension, and cloud through a single account. Free and Go users can try it for a limited time—likely a response to Claude Code's success with knowledge workers and growing demand for agentic systems (see Claude Cowork) that handle more complex tasks than standard chatbots.
Jerry Tworek, one of the minds behind OpenAI's reasoning models, sees a fundamental problem with current AI: it can't learn from mistakes. "If they fail, you get kind of hopeless pretty quickly," Tworek says in the Unsupervised Learning podcast. "There isn't a very good mechanism for a model to update its beliefs and its internal knowledge based on failure."
The researcher, who worked on OpenAI's reasoning models like o1 and o3, recently left OpenAI to tackle this problem. "Unless we get models that can work themselves through difficulties and get unstuck on solving a problem, I don't think I would call it AGI," he explains, describing AI training as a "fundamentally fragile process." Human learning, by contrast, is robust and self-stabilizing. "Intelligence always finds a way," Tworek says.
OpenClaw (formerly Clawdbot) and Moltbook let attackers walk through the front door
How secure are AI agents? Not very, it turns out. OpenClaw’s system prompts can be extracted with a single attempt. Moltbook’s database was publicly accessible—including API keys that could let anyone impersonate users like Andrej Karpathy.
Google Deepmind pioneer David Silver departs to found AI startup, betting LLMs alone won't reach superintelligence
David Silver, one of the key AI researchers behind landmark Deepmind projects like AlphaGo and AlphaZero, is leaving the Google subsidiary to found his own startup. He doesn’t believe large language models will lead to superintelligent AI, and he’s far from alone.
OpenAI still leads enterprise AI, but Anthropic is gaining fast, according to new study
An oligopoly is taking shape in enterprise AI: OpenAI still leads, but Anthropic is catching up fast while Microsoft dominates applications. And the open-source revolution? For large companies, it’s not happening yet. If anything, they’re moving the other way.