Deepmind's research AI occasionally solves what humans can't and mostly gets everything else wrong
Google Deepmind's AI agent Aletheia independently wrote a math paper, disproved a decade-old conjecture, and caught an error that cryptography experts had missed. But a systematic evaluation across 700 open problems puts those achievements in perspective. The researchers also provide a playbook for how scientists can work effectively with AI.
Google Deepmind has published two research papers showing how its AI model Gemini Deep Think can assist with real research problems. At the center is a system called Aletheia, built on top of a new version of Gemini Deep Think and designed as a digital research assistant for mathematics. A second paper covers applications in physics, computer science, and economics. OpenAI published a similar paper last year.
The highlights include a math paper written entirely by the AI, collaborative proofs with human mathematicians, the disproval of a ten-year-old conjecture, and the discovery of a critical error in a cryptography paper. On the other side of the ledger: a systematic evaluation across 700 open math problems where only 6.5 percent of the AI's answers were actually useful.
Three digital agents that check each other's work
According to the paper, Aletheia follows a straightforward principle: one AI component proposes a solution, a second checks it for errors, and a third revises flawed approaches. This cycle repeats until the checker accepts the solution or a set attempt limit is reached. Importantly, the system can also admit when it can't solve a problem - saving human researchers time during collaboration.
Mathematical research, unlike competition problems like those found in the Math Olympiad, requires extensive domain knowledge from existing literature. This is where AI models run into a well-known problem: they make up sources. Aletheia uses Google Search and web browsing to verify references. The researchers say this significantly reduced obvious fabrications like invented book titles or author names. But the errors shifted: the model now cites real papers but sometimes misrepresents their contents - a problem the Halluhard benchmark recently highlighted as well.
On a benchmark of 30 difficult Olympiad-level problems, Aletheia achieved 95.1 percent accuracy - a major jump from the 65.7 percent its predecessor scored in July 2025. For harder PhD-level problems, though, the system only produced answers for fewer than 60 percent of the problems.
AI writes a complete research paper without a human mathematician
The researchers document several results with varying levels of AI involvement. The entire mathematical content of one research paper on a specialized problem in arithmetic geometry was, according to the authors, produced entirely by the AI. Aletheia used methods from a subfield of mathematics that the human authors of the broader project weren't even familiar with.
In a second paper, the roles were reversed: Aletheia provided the high-level proof strategy while human mathematicians worked out the technical details. The researchers note this is unusual, since AI is typically used for detail work rather than big-picture strategy.
All final versions of the research papers were written by human authors, though. The reasoning: anyone who signs a math paper takes responsibility for its entire content, including correct citations. Only a human can do that.
Only 6.5 percent of answers to open problems turned out useful
The most revealing analysis involved 700 open problems posed by Hungarian mathematician Paul Erdos, collected in an online database. Between December 2 and 9, 2025, the team turned Aletheia loose on all problems marked as unsolved at the time. Some of these problems have since been solved with AI assistance, including with OpenAI's GPT-5.
The results: of 200 clearly evaluable answers, 137 (68.5 percent) were fundamentally wrong. 63 (31.5 percent) were mathematically correct, but only 13 (6.5 percent) actually answered the question that was asked. The remaining 50 formally correct solutions were "mathematically empty" - the model had reinterpreted the question in a way that made the answer trivial.
The researchers describe this as a form of "specification gaming": the AI systematically reinterprets questions to make them as easy as possible to answer, even when the resulting interpretation would be obviously off-base to a human expert.
Connecting distant fields is where the AI shines
Deepmind's second paper documents collaboration with domain experts on 18 research problems across computer science, physics, and economics. It builds on earlier work using Gemini Deep Think as an automated reviewer for conference submissions in theoretical computer science.
The researchers identify the model's ability to draw connections between distant fields as a particular strength. On a classic network optimization problem, for instance, the model pulled in mathematical tools from geometric functional analysis - a field that algorithm specialists wouldn't typically consider. On a problem involving gravitational radiation from cosmic strings, the system found six different solution approaches.
A complete paper written in eight prompts
One especially illustrative experiment comes from computer scientist Lance Fortnow. He used an AI-integrated text editor to write a complete research paper. Eight prompts were all it took. The model found the proof of the main result on its own but made an error on a corollary: it assumed a mathematical statement that is actually an open problem. After a hint, it corrected the proof immediately.
Fortnow described the experience as feeling wrong, like he had cheated - comparing it to the first time he used LaTeX and the paper looked much better than it deserved.
Another example: a 2015 conjecture about an optimization problem that experts had failed to resolve for a decade. The model disproved it in a single run, constructing a specific counterexample with just three elements that exposed the intuitive conjecture as false.
In cryptography, the model identified a serious error in a current preprint that had claimed an important breakthrough. The discrepancy between a theoretical definition and the actual technical implementation was so subtle that human reviewers had missed it during initial peer review. Independent experts confirmed the finding, and the authors updated their paper.
How scientists can get the most out of AI collaboration
From these documented experiences, the researchers behind the second paper distill a set of guidelines for scientists working with AI models. The core recommendation: treat the model like a capable but error-prone junior researcher, not an oracle.
Specifically, the researchers recommend breaking large research questions into small, verifiable sub-problems rather than confronting the model with a complete open problem. When the model makes a mistake, a specific hint about the error often leads to a correct - and sometimes more elegant - solution on the next attempt.
"Balanced prompting" proved especially effective: instead of asking the model to prove a conjecture, researchers should ask for either a proof or a disproof. This reduces the model's tendency to support the thesis stated in the prompt at all costs.
One practical trick involves dealing with well-known open problems: the model sometimes refuses to even attempt a problem if it recognizes it as unsolved. In those cases, stripping the context and entering just the bare problem statement without any reference to its status helps. The researchers call this "context de-identification." Feeding relevant papers directly as context works equally well, since the model then constructs significantly better proofs.
For problems where symbolic math can be verified numerically, the researchers recommend a "neuro-symbolic loop": the model proposes a mathematical solution, writes its own program to verify it numerically, and if the computation fails, the error messages are automatically fed back to the model. This lets the AI discard invalid solution paths on its own. When calculating cosmic radiation, this approach eliminated over 80 percent of roughly 600 solution candidates early on.
A new rating system could help separate hype from real progress
To counter hype around AI-generated mathematics, the researchers propose a standardized rating system. Results would be classified along two axes: the degree of AI involvement (primarily human, collaborative, or essentially autonomous) and scientific significance (from "negligible" to "generational breakthrough").
The researchers deliberately rate their own results modestly. The solved Erdos problems, despite their decades-long "open" status, are mathematically rather elementary. The autonomous paper on eigenweights is publishable but falls within the broad range of typical journal publications. The researchers explicitly do not claim results at the "major advance" or "landmark breakthrough" level.
They also propose "Human-AI Interaction Cards" that document which prompts and AI outputs led to key findings. Terence Tao, one of the world's most prominent mathematicians, has already set up a community wiki to publicly track AI-assisted progress on Erdos problems.
Broad knowledge helps, but confidence in wrong answers remains a problem
The researchers emphasize that AI currently cannot reliably solve research-level mathematics. The successes so far stem more from the model's enormous breadth of knowledge and clever technical workarounds than from genuine mathematical creativity. Errors are often presented with high confidence, which makes collaboration challenging.
The second paper also warns of a potential peer review crisis: if AI massively accelerates the production of technically complex research papers, the bottleneck in science shifts from generating ideas to verifying them. Traditional review processes aren't built for that.
Still, the authors of both papers see Gemini Deep Think as a "force multiplier" for human research. The model can handle knowledge retrieval and routine verification, freeing researchers to focus on the actual thinking. Whether this division of labor works in practice, though, depends on how well humans can verify the AI's output.
Deepmind isn't alone in this view. Kevin Weil, head of the science team at competitor OpenAI, expects AI use in science to become as routine this year as it already is in software engineering. By 2028, his company aims to build an autonomous research agent.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe nowAI news without the hype
Curated by humans.
- Over 20 percent launch discount.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.