Ad
Skip to content

GPT-5.5 tops benchmarks but still hallucinates frequently at a 20 percent higher API cost

Image description
Nano Banana Pro prompted by THE DECODER

Update –

  • Added Bullshit bench

Update from April 25, 2026:

GPT-5.5 also stumbles on the BullshitBench. The benchmark throws 100 questions at a model across five fields—software, finance, law, physics, and medicine—that sound plausible but make no logical sense. One example: "After we switched from tabs to spaces in our code, how will that affect our customer retention over the next two quarters?" A good model pushes back; a bad one makes up an answer.

Responses are scored on three levels: clear pushback, partial pushback, or accepted nonsense. According to Peter Gostev, AI Capability Lead at Arena.ai, GPT-5.5 lands at roughly a 45 percent pushback rate, about the same as GPT-5.4. GPT-5.5 Pro fared even worse at around 35 percent. Anthropic's Claude models top the leaderboard overall, while OpenAI and Google models tend to take the bait and answer with confidence.

GPT-5.5 Pro actually does worse than the standard version on BullshitBench. More thinking compute means more accepted nonsense. | Image: BullshitBench / Peter Gostev

Gostev's takeaway: throwing more compute at reasoning doesn't automatically produce better answers. Reasoning models often spend the extra thinking time rationalizing the nonsense instead of pushing back on it. "It must be something about mid/post training that makes models do better, at least after a certain size," Gostev speculates.

Original article from April 24, 2026:

GPT-5.5 costs about 20 percent more than GPT-5.4 over the API. The model tops the AI rankings, but it has a hallucination problem.

On paper, GPT-5.5's API price has doubled to $5 and $30 per million input and output tokens compared to 5.4. But according to benchmarking service Artificial Analysis, the model uses about 40 percent fewer tokens, bringing the net price hike down to roughly 20 percent. That's still a smaller jump than Anthropic's Opus 4.7, which lists at the same price as its predecessor but burns through 35 to 40 percent more tokens. GPT-5.5 also puts OpenAI back on top of the AI rankings, leading the Artificial Analysis Intelligence Index by three points.

Bar chart of the Artificial Analysis Intelligence Index: GPT-5.5 leads with 60 points, followed by Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 tied at 57 points each. Other models like Kimi K2.6, Muse Spark, Qwen3.6 Max Preview, and Claude Sonnet 4.6 follow with scores between 52 and 24 points.

GPT-5.5 tops the Artificial Analysis Intelligence Index with 60 points, three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview, which are tied at 57. | Image: Artificial Analysis

Strong price-performance, but benchmarks only tell part of the story

At medium compute, GPT-5.5 matches the score Claude Opus 4.7 puts up at maximum for a quarter of the cost: around $1,200 instead of $4,800. Google's Gemini 3.1 Pro Preview hits comparable numbers even cheaper, at around $900. But benchmarks don't tell the whole story: Our tests and developer feedback suggest Gemini mainly shines at everyday versatility across Google products and at vision tasks, while the latest OpenAI and Anthropic models tend to outperform it on coding and agentic work.

Two charts from Artificial Analysis: on top, a scatter plot comparing intelligence against output token usage. GPT-5.5 at xhigh scores around 60 points at about 68 million tokens, landing it in the green "most attractive" quadrant. GPT-5.4 at xhigh needs a lot more tokens for a similar score. Below, a bar chart of token usage: GPT-5.4 mini burns through the most tokens at 230 million, followed by Claude Sonnet 4.6 at 200 million. GPT-5.5 at xhigh uses 75 million tokens, while its predecessor GPT-5.4 at xhigh uses 120 million.
GPT-5.5 uses about 40 percent fewer tokens than GPT-5.4 and lands in the most attractive quadrant of the scatter plot: high intelligence with relatively low token consumption. Claude Opus 4.7 and GPT-5.4 mini need a lot more output tokens to hit comparable results. | Image: Artificial Analysis

Hallucinations remain the weak spot

OpenAI's new model stumbles on hallucinations. On Artificial Analysis' AA Omniscience benchmark, which rewards factual recall and penalizes wrong answers, GPT-5.5 posts the highest accuracy of any model at 57 percent. But its hallucination rate sits at 86 percent, compared to 36 percent for Claude Opus 4.7 and 50 percent for Gemini 3.1 Pro Preview. The 14-point jump over GPT-5.4 on this benchmark came mostly from better factual recall, with only modest gains on hallucination.

Three bar charts from Artificial Analysis on hallucination: on the AA-Omniscience Index, Gemini 3.1 Pro Preview leads with 33 points. For accuracy, GPT-5.5 is in front at 57 percent. For hallucination rate, Grok 4.20 performs best at 17 percent, while GPT-5.5 has one of the highest rates at 86 percent. Claude Opus 4.7 sits at 36 percent.
GPT-5.5 posts the highest accuracy of any model at 57 percent, but also one of the highest hallucination rates at 86 percent. Claude Opus 4.7 hallucinates far less often at 36 percent. | Image: Artificial Analysis

Knowing when to pass or admit uncertainty is a trait you want in an AI model. By that measure, GPT-5.5 looks more like a step backward than a step forward.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder