GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1

Dec 11, 2025

Sora prompted by THE DECODER

Key Points

OpenAI has launched GPT-5.2 in three versions—Instant, Thinking, and Pro—claiming it is the first AI model to achieve expert-level performance in knowledge work tasks.
The most notable progress is in abstract reasoning: GPT-5.2 Thinking scores 52.9 percent on the ARC-AGI-2 benchmark, a significant increase from the 17.6 percent achieved by GPT-5.1 Thinking.
Coding abilities have improved as well, with the SWE-bench Verified score rising from 76.3 to 80 percent, but API usage costs have also increased to $1.75 per million input tokens and $14 per million output tokens.

Just four weeks after releasing GPT-5.1, OpenAI is back with GPT-5.2 and some substantial benchmark improvements.

Whether ironic or sincere, OpenAI CEO Sam Altman commented on the GPT-5.2 release with a simple observation: "We've come a long way since GPT-5.1."

According to OpenAI, GPT-5.2 is "the most capable model series yet for professional knowledge work." The company is shipping three variants: GPT-5.2 Instant for quick everyday tasks, GPT-5.2 Thinking for more complex work, and GPT-5.2 Pro as a premium option for particularly demanding queries.

The new model matches Google's Gemini 3 Pro in many benchmarks and clearly outperforms it in others—notably the coding benchmark SWE-Verified and the abstract reasoning benchmark ARC-AGI-2. Google's strong showing with Gemini likely lit a fire under OpenAI to respond quickly.

First model to reach expert-level performance on knowledge work

In the GDPval benchmark, which tests knowledge work tasks across 44 professional fields and measures the economic relevance of AI models, GPT-5.2 Thinking hits 70.9 percent according to OpenAI—up from 38.8 percent for GPT-5 Thinking. That's a jump of over 80 percent in just a month. OpenAI describes it as the first model to perform at or above human expert level on well-specified tasks.

OpenAI put numbers to the efficiency gains, claiming the model delivers results at more than eleven times the speed and at less than one percent of the cost of industry experts. In an internal benchmark for investment banking tasks like financial modeling, average scores improved from 59.1 to 68.4 percent, according to the company.

GPT-5.2 Thinking also sets new internal records for OpenAI in coding. In SWE-Bench Pro, which tests realistic programming tasks across four languages, the model scores 55.6 percent compared to 50.8 percent for GPT-5.1 Thinking. In SWE-bench Verified, the score climbs from 76.3 to 80 percent.

Fewer hallucinations and better long-context understanding

OpenAI claims to have cut response error rates by 30 percent. In a test using anonymized ChatGPT requests, the percentage of responses with at least one error dropped from 8.8 percent (GPT-5.1 Thinking) to 6.2 percent (GPT-5.2 Thinking). The test ran with maximum reasoning settings and search enabled, though the results were verified by other AI models, which aren't infallible themselves.

OpenAI notes that error rates for specific assertions are significantly lower than for full answers, since a single response often contains multiple claims. Still, the company warns that GPT-5.2 Thinking isn't perfect, and users should verify its output.

On the long-context front, OpenAI says GPT-5.2 Thinking is the first model to hit nearly 100 percent accuracy on the 4-Needle MRCR test up to 256,000 tokens. That benchmark requires the model to find and cite hidden details within massive blocks of text. The improvement aims to help with heavy documentation tasks like analyzing contracts, research papers, or transcripts.

Image analysis also sees a boost, with OpenAI claiming it halved error rates. Accuracy on the CharXiv reasoning benchmark, which quizzes models on scientific diagrams, jumped from 80.3 to 88.7 percent. ScreenSpot-Pro scores, measuring UI understanding, improved from 64.2 to 86.3 percent.

For tool calling—the ability to use external apps or APIs—GPT-5.2 Thinking hit 98.7 percent in Tau2-bench-Telecom, up from 95.6 percent. The benchmark simulates complex customer service tickets where the AI has to juggle multiple tools.

Abstract reasoning sees massive gains

The most dramatic progress appears in the ARC-AGI-2 abstract reasoning benchmark. GPT-5.2 Thinking hit 52.9 percent, blowing past GPT-5.1 Thinking's 17.6 percent. That puts it well ahead of Gemini 3 Pro, which turned heads with a 31.1 percent score at launch just a few weeks ago.

GPT-5.2 Pro also cracked the 90 percent mark on the simpler ARC-AGI-1 test, reaching 90.5 percent. According to OpenAI, it achieves that performance at roughly 390 times lower cost than the o3-preview model from late 2024.

Still, both models trail Poetiq's AI system, which uses a hybrid approach combining open-source models with commercial engines like Gemini 3 and GPT-5.1.

Benchmark	GPT-5.2 Thinking	GPT-5.1 Thinking	Gemini 3 Pro
GDPval (wins or ties) - Knowledge work tasks	70.9%	38.8% (GPT-5)	-
SWE-Bench Pro (public) - Software engineering	55.6%	50.8%	-
SWE-bench Verified - Software engineering	80.0%	76.3%	76.2%
GPQA Diamond (no tools) - Science questions	92.4%	88.1%	91.9%
CharXiv Reasoning (w/ Python) - Scientific figure questions	88.7%	80.3%	81.4%
AIME 2025 (no tools) - Competition math	100.0%	94.0%	95.0%
FrontierMath (Tier 1-3) - Advanced mathematics	40.3%	31.0%	-
FrontierMath (Tier 4) - Advanced mathematics	14.6%	12.5%	-
ARC-AGI-1 (Verified) - Abstract reasoning	86.2%	72.8%	-
ARC-AGI-2 (Verified) - Abstract reasoning	52.9%	17.6%	31.1%
ScreenSpot-Pro - Screen understanding	86.3%	64.2%	72.7%
Video-MMMU - Knowledge from videos	85.9%	82.9%	87.6%
MMMLU - Multilingual Q&A	89.6%	89.5%	91.8%

Of course, benchmarks don't always tell the full story. GPT-5.1 Thinking was already holding its own, particularly on complex analysis and web search. If 5.2 delivers on these numbers, the "OpenAI is falling behind" narrative looks premature.

Performance comes at a premium

API pricing is ticking up. GPT-5.2 costs $1.75 per million input tokens and $14 per million output tokens—a bump from the $1.25/$10 rates for GPT-5.1. Cached inputs get a 90 percent discount. The Pro tier is significantly steeper at $21 for input and $168 for output per million tokens.

That puts OpenAI roughly in line with Google's Gemini 3 Pro, which charges $2/$12 per million tokens for standard context. Anthropic remains the priciest option: its comparable Opus 4.5 model costs $5 per million input tokens and $25 for output.

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-5.2	$1.75	$14
GPT-5.2 Pro	$21	$168
GPT-5.1	$1.25	$10
Gemini 3 Pro	$2	$12
Claude Opus 4.5	$5	$25

OpenAI argues the price hike reflects the performance boost, noting that improved token efficiency should actually lower total costs for complex tasks.

The new models are rolling out now to ChatGPT paid subscribers, including Plus, Pro, and Enterprise users. GPT-5.1 will stick around as a legacy option for three months. For developers, OpenAI says it has no immediate plans to deprecate GPT-5.1, GPT-5, or GPT-4.1 in the API.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: OpenAI