AI research Archive

Jonathan Kemper

Feb 15, 2026

AI research

Popular LLM ranking platforms are statistically fragile, new study warns

Stacked playing cards with two floating crowns at the top, tweezers remove two cards from the stack in a colorful shadow look.

Jonathan Kemper

Feb 14, 2026

AI research

Google Deepmind's new bioacoustic model shows the power of generalization by detecting whales with bird training

Illustration of a whale under water emitting sound waves that are received by a bird in the air.

Matthias Bastian

Feb 13, 2026

Short News

Anthropic recruits ex-Google data center veterans to build its own AI infrastructure empire

Anthropic is discussing building at least 10 gigawatts of data center capacity worth hundreds of billions of dollars, recruiting ex-Google managers and lining up Google as a financial backer to make it happen.

Read full article

Comment

Matthias Bastian

Feb 12, 2026

AI research

Google Deepmind has upgraded its specialized thinking mode "Gemini 3 Deep Think" and made it available through the Gemini app and as an API via a Vertex AI early access program. The upgrade targets complex tasks in science, research, and engineering.

Google AI Ultra subscribers can access Deep Think through the Gemini app, while developers and researchers can sign up separately for the API program. According to Google Deepmind, the model tops several major benchmarks:

Benchmark	Deep Think	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro Preview
ARC-AGI-2 (Logical reasoning)	84.6%	68.8%	52.9%	31.1%
Humanity's Last Exam (Academic reasoning)	48.4%	40.0%	34.5%	37.5%
MMMU-Pro (Multimodal reasoning)	81.5%	73.9%	79.5%	81.0%
Codeforces (Coding/algorithms, Elo)	3,455	2,352	-	2,512

While Deep Think dominates in logic and coding, the gap narrows significantly on MMMU-Pro: it scored 81.5 percent, barely ahead of Gemini 3 Pro Preview at 81.0 percent. This suggests the thinking upgrades focus heavily on abstract reasoning rather than visual processing. Deep Think also achieved gold medal-level results at the 2025 Physics and Chemistry Olympiads. Examples of the model in scientific use can be found here.

Comment Source: Google Deepmind via X

Matthias Bastian

Feb 10, 2026

Short News

Isomorphic Labs, Google DeepMind's AI medicine startup, has unveiled a new system called "Isomorphic Labs Drug Design Engine" (IsoDDE) that it says outperforms AlphaFold 3. According to the company, IsoDDE doubles AlphaFold 3's accuracy when predicting protein-ligand structures that differ significantly from the training data (see left graph below).

IsoDDE outperforms previous methods in structure prediction, binding pocket recognition, and binding strength prediction, according to Isomorphic Labs. | Image: Isomorphic Labs

Beyond structure prediction, IsoDDE can identify previously unknown docking sites on proteins in seconds based solely on their blueprint, with accuracy that Isomorphic Labs says approaches that of lab experiments. Isomorphic Labs also claims the system can estimate how strongly a drug binds to its target at a fraction of the time and cost of traditional methods. These capabilities could uncover new starting points for active compounds and speed up computational screening.

Isomorphic Labs says it already uses IsoDDE daily in its own research programs to develop new drug candidates. Details are available in the company's technical report.

Comment Source: Isomorphic Labs

Jonathan Kemper

Feb 9, 2026

AI research

New benchmark shows AI models still hallucinate far too often

Web result windows send beams onto overlapping transparent sheets and symbolize source aggregation.

Jonathan Kemper

Feb 8, 2026

AI research

Best multimodal models still can't crack 50 percent on basic visual entity recognition

A new benchmark called WorldVQA tests whether multimodal AI models actually recognize what they see or just make it up. Even the best performer, Gemini 3 Pro, tops out at 47.4 percent when asked for specific details like exact species or product names instead of generic labels. Worse, the models are convinced they’re right even when they’re wrong.

Read full article

Comment

Jonathan Kemper

Feb 8, 2026

AI research

Study finds AI reasoning models generate a "society of thought" with arguing voices inside their process

New research reveals that reasoning models like Deepseek-R1 simulate entire teams of experts when solving problems: some extraverted, some neurotic, all conscientious. This internal debate doesn’t just look like teamwork. It measurably boosts performance.

Read full article

Comment

Jonathan Kemper

Feb 7, 2026

AI research

Google's PaperBanana uses five AI agents to auto-generate scientific diagrams

Researchers at Peking University and Google built a system that turns method descriptions into scientific diagrams automatically. Five specialized AI agents handle everything from finding reference images to quality control, tackling one of the last manual bottlenecks in academic publishing.

Read full article

Comment

Maximilian Schreiner

Feb 6, 2026

Short News

Anthropic's security training fails when Claude operates a graphical user interface.

In pilot tests, Claude was able to get Opus 4.6 to provide detailed instructions on how to make mustard gas in an Excel spreadsheet and maintain an accounting spreadsheet for a criminal gang - behaviors that did not or rarely occurred in text-only interactions.

"We found some kinds of misuse behavior in these pilot evaluations that were absent or much rarer in text-only interactions," Anthropic writes in the Claude Opus 4.6 system card. "These findings suggest that our standard alignment training measures are likely less effective in GUI settings."

According to Anthropic, tests with the predecessor model Claude Opus 4.5 in the same environment showed "similar results" - so the problem persists across model generations without having been noticed. The vulnerability apparently arises because, while models learn to reject malicious requests in conversation, they do not fully transfer this behavior to agent-based tool usage.

Comment Source: Opus 4.6 System Card