Ad
Skip to content

Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles

Image description
Nano Banana Pro prompted by THE DECODER

The latest round of the industry's top inference benchmark introduces multimodal and video models for the first time. Nvidia, AMD, and Intel each highlight different metrics, making direct comparisons difficult.

Benchmark organization MLCommons published the results of MLPerf Inference v6.0 on April 1, 2026. All three major chipmakers submitted results and claimed top spots. But the results are only partially comparable: Nvidia, AMD, and Intel use different system configurations, models, and scenarios, and each company frames its numbers to put its own strengths front and center.

Nvidia, for example, showcases its records primarily on DeepSeek-R1 and the new GPT-OSS-120B, sometimes using 288-GPU configurations. AMD compares itself to Nvidia's B200 and B300 in single-node setups with eight GPUs but didn't submit results for DeepSeek-R1 or the multimodal Qwen3-VL. Intel targets an entirely different market segment, competing with workstation GPUs. Anyone trying to make sense of these numbers needs to keep these differences in mind.

Notably absent are submissions from Google for its latest Ironwood-generation TPU chips or inference specialists like Cerebras.

Five new benchmarks significantly expand the test suite

MLPerf Inference v6.0 adds several new tests: an interactive scenario for DeepSeek-R1 with a five-times-higher minimum token rate, the vision-language model Qwen3-VL-235B as the suite's first multimodal model, OpenAI's GPT-OSS-120B, the text-to-video model WAN-2.2-T2V, and the transformer-based recommendation benchmark DLRMv3. Only Nvidia submitted results for all new models and scenarios.

Software optimizations alone double Nvidia's throughput on the same hardware

According to Nvidia, the GB300-NVL72 system with Blackwell Ultra GPUs achieved the highest throughput across all new workloads. The company highlights a 2.7x performance jump on DeepSeek-R1 in the server scenario compared to its first submission six months ago2 achieved on the same hardware through software optimizations alone. Nvidia partner Nebius delivered this improvement. Nvidia says it cuts token production costs by more than 60 percent.

These gains came from a series of software-level tweaks. Basic compute operations were sped up and fused together so GPUs spend less time on overhead. The open-source framework Nvidia Dynamo separates the two phases of text generation (processing the input and generating new tokens) and optimizes each independently.

For models like DeepSeek-R1 that only activate a subset of their parameters per request, Wide Expert Parallel distributes expert weights across more GPUs so no single card becomes a bottleneck. When batch sizes are small in interactive scenarios and compute power sits idle, Multi-Token Prediction generates multiple tokens at once instead of just one. Even on the older Llama 3.1 405B, server performance improved by 1.5x, according to Nvidia.

In the largest configuration ever submitted to MLPerf Inference, Nvidia connected four GB300-NVL72 systems with a total of 288 GPUs over Quantum-X800 InfiniBand. The result: roughly 2.49 million tokens per second on DeepSeek-R1 in the offline scenario. Fourteen partners submitted results on the Nvidia platform, the most of any platform in this round. Nvidia puts its cumulative MLPerf wins since 2018 at 291 - nine times more than all other submitters combined.

AMD closes the single-node gap and crosses one million tokens per second

According to AMD's blog post, the Instinct MI355X on CDNA 4 architecture with 3 nm manufacturing and up to 288 GB HBM3E crossed the one-million-tokens-per-second mark in MLPerf for the first time - though with multi-node scaling using up to 94 GPUs on Llama 2 70B and GPT-OSS-120B. Compared to the previous-generation MI325X, AMD says the MI355X delivers a 3.1x throughput jump on the Llama 2 70B server benchmark.

The most direct comparison comes in single-node setups with eight GPUs each. AMD says the MI355X matched Nvidia's B200 on Llama 2 70B in the offline scenario, hit 97 percent in the server scenario, and reached 119 percent of B200 performance in the interactive scenario. Against the newer B300, those numbers came in at 92, 93, and 104 percent, respectively. On GPT-OSS-120B, AMD says the MI355X beat the B200 by 11 and 15 percent in offline and server mode, but trailed the B300 at 91 and 82 percent.

Two important caveats apply here: AMD didn't submit results for the significantly larger DeepSeek-R1 with its MoE architecture, so exactly where Nvidia posts its strongest numbers. And AMD's submission for the text-to-video model Wan-2.2 was in the Open category rather than the Closed Division, which formally limits direct comparability. AMD also cites post-deadline results that reportedly reached 108 percent of B200 performance, but notes these numbers weren't verified by MLCommons.

Multi-node scaling across 11 nodes achieved 93 to 98 percent efficiency, according to AMD. Also noteworthy is the first-ever heterogeneous MLPerf submission: Dell and MangoBoost combined MI300X, MI325X, and MI355X GPUs across sites in the US and Korea, hitting roughly 142,000 tokens per second on Llama 2 70B in server mode. Nine partners submitted results on AMD hardware, with scores within four percent of AMD's own measurements.

Intel skips the data center fight, targets workstations and edge instead

Intel takes a fundamentally different approach. Rather than competing with Nvidia and AMD in the data center, Intel showcases its Arc Pro B70 and B65 GPUs alongside Xeon 6 processors as an inference platform for workstations and edge systems. A system with four Arc Pro B70 cards provides 128 GB of VRAM and can run 120-billion-parameter models with high parallelism, according to Intel. The Arc Pro B70 delivers up to 1.8x the inference performance of the Arc Pro B60.

Software optimizations on the same B60 hardware reportedly brought up to 1.18x performance improvements over MLPerf v5.1. Intel emphasizes that it's the only server processor maker submitting standalone CPU results for MLPerf Inference. More than half of all submissions in MLPerf 6.0 use Xeon as the host CPU.

Why these results don't produce a simple ranking

The results show that while MLPerf Inference remains the most important industry standard for AI inference benchmarks, it doesn't produce a straightforward leaderboard. Nvidia has consistently demonstrated the broadest coverage of new benchmarks and the highest absolute throughput numbers at scale for many years. In single-node setups, however, AMD claims comparable or higher scores than Nvidia's B200 in several scenarios while covering fewer benchmarks. Intel serves a different market entirely.

On top of that, each chipmaker naturally highlights the scenarios and configurations where its own products perform best. AMD's percentage comparisons against Nvidia's B200 and B300 represent the most transparent head-to-head data available, but they only apply to the models and scenarios AMD actually submitted. Nvidia's scaling results with 288 GPUs have no AMD counterpart. And Nvidia's 2.7x software improvement and AMD's 3.1x generational leap measure fundamentally different things: pure software optimization on the same hardware versus a new chip architecture.

Nvidia pushes for a new benchmark that measures real-world API performance

A step toward better comparability could come with the upcoming MLPerf Endpoints benchmark. Nvidia announces in its blog post that it's driving the definition of this benchmark within the MLCommons consortium. The reasoning is that current tests measure the throughput of individual chips and systems under standardized conditions but don't capture how an inference service actually performs under real API traffic. With the rise of agentic AI systems that demand especially fast token rates, Nvidia says the need for measurement methods that go beyond pure chip benchmarks is growing. This naturally plays to Nvidia's strengths: the company recently unveiled Vera Rubin, a system designed specifically for these workloads.

According to Nvidia, MLPerf Endpoints would give the community a verifiable picture of how deployed services actually perform under realistic load. The goal is to capture metrics that hardware benchmarks alone can't reveal, such as latency variability, throughput under concurrent requests, and overall infrastructure efficiency.

AMD, meanwhile, points to its planned MI400 series on CDNA 5 architecture and the Helios rack-scale solution for 2026. The competition for the most efficient AI inference is set to intensify further.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

  • More than 16% discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder