Grok 4.20 trails Gemini and GPT-5.4 by a wide margin but sets a new record for not hallucinating

Mar 12, 2026

xAI's Grok 4.20 can't keep up with the top AI models in benchmarks but hallucinates less than any other model tested. According to Artificial Analysis, Grok 4.20 Beta scores 48 on the Intelligence Index with reasoning enabled, well behind Gemini 3.1 Pro Preview and GPT-5.4 at 57, but still a 6-point improvement over Grok 4.

Grok hängt den neuesten Modellen der großen KI-Labore hinterher. | Bild: Artificial Analysis — Grok trails the latest models from major AI labs in overall benchmark performance. | Image: Artificial Analysis

xAI shipped three API variants: with reasoning, without reasoning, and a multi-agent mode. The model supports a 2-million-token context window and costs 2 or 6 dollars per million tokens; cheaper than Grok 4 and competitively priced among Western models.

Where Grok 4.20 stands out, of all things, is factual reliability. On the AA Omniscience test, it hit a 78 percent non-hallucination rate, a record, according to Artificial Analysis. The test measures how often a model fabricates an answer instead of admitting it doesn't know, alongside factual recall. Grok 4.20 only got it wrong about one in five times when it didn't have the answer.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Artificial Analysis