Inception launches Mercury 2, the first diffusion-based language reasoning model

Feb 24, 2026

Mercury

Key Points

AI startup Inception Labs has introduced Mercury 2, a reasoning model that replaces the standard transformer architecture with a diffusion-based approach to generating text.
Instead of producing text token by token, Mercury 2 refines multiple text blocks at the same time, similar to an editor reworking an entire draft at once rather than focusing on one word at a time.
Running on Nvidia Blackwell GPUs, Mercury 2 reaches 1,009 tokens per second according to Inception Labs, making it more than five times faster than conventional models.

AI startup Inception Labs launches the first diffusion-based reasoning AI model. It's fast and cheap.

Unlike conventional language models, Mercury 2 refines multiple text blocks simultaneously instead of going through a text word for word. The startup compares this to an editor revising an entire draft at once rather than looking at individual words.

The most obvious benefit so far is speed and price. According to Inception, Mercury 2 hits 1,009 tokens per second on Nvidia Blackwell GPUs with an end-to-end latency of just 1.7 seconds, compared to 14.4 seconds for Gemini 3 Flash and 23.4 seconds for Claude Haiku 4.5 with reasoning enabled. Output quality is comparable to leading speed-optimized models, the company claims.

Pricing sits at $0.25 per million input tokens and $0.75 per million output tokens, undercutting Gemini 3 Flash ($0.50/$3.00) by half on input and four times on output, and roughly four times cheaper than Claude Haiku 4.5 ($1.00/$5.00) on input and more than two and a half times on output.

Model	E2E Latency (Seconds)	GPQA Diamond	LCB	SciCode	IFBench	AIME	TAU
Mercury 2	1.7	74	67	38	71	91	53
GPT-5 Nano (Minimal)	4.5	43	47	29	33	27	26
Claude 4.5 Haiku (Non-Reasoning)	5.0	65	51	34	42	39	33
Gemini 2.5 Flash-Lite (Reasoning)	7.8	71	59	29	53	69	31
Gemini 3 Flash (Reasoning)	14.4	90	91	51	78	78	80
Gemini 2.5 Flash (Reasoning)	15.6	79	69	39	50	57	32
GPT-5 Mini (Medium)	22.8	80	69	41	71	48	71
Claude 4.5 Haiku (Reasoning)	23.4	67	62	43	54	84	55

The model supports a 128K context window, tool usage, and JSON output. Inception is targeting companies with latency-sensitive applications like voice assistants, coding tools, and search systems.

Mercury 2 is available now through an OpenAI-compatible API. Companies can apply for early access, and the model can also be tested directly in the chat.

The AI industry is looking for what comes after the Transformer

Last November, Inception raised $50 million from investors including Microsoft, Nvidia, and Snowflake. The startup showed its first prototype in early 2025. With Mercury 2, Inception is now shipping a production-ready model with reasoning capabilities.

Google Deepmind is also working on diffusion-based language models. Gemini Diffusion performed on par with the then-current Gemini 2.0 Flash Lite model in benchmarks. However, Google hasn't said anything about the diffusion experiment since it was first presented in May 2025.

More broadly, a growing number of startups are exploring alternatives to the dominant Transformer architecture. Whether diffusion-based language models can hold up long-term remains an open question.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Source: Inception Labs