Nvidia Blackwell doubles AI training performance in early benchmarks

Nov 13, 2024

Nvidia

Key Points

Nvidia has announced the first MLPerf Training 4.1 benchmark results for its Blackwell platform, which shows up to 2.2 times more performance per GPU in LLM training than the previous generation Hopper.
The new architecture uses optimized tensor cores and allows the GPT-3 175B benchmark to be run on just 64 GPUs, compared to 256 GPUs for the previous generation.
However, Nvidia also sees an increased focus on scaling inference, triggered by real-time chatbots and reasoning models such as OpenAI's o1.

Nvidia presents the first benchmark results of the new Blackwell platform for AI training as part of the MLPerf Training 4.1 benchmark. According to the results, performance has more than doubled compared to the previous generation in some cases.

In the MLPerf Training 4.1 benchmarks, the Nvidia Blackwell platform delivered 2.2 times more performance per GPU compared to Hopper in the LLM benchmark Llama 2 70B fine-tuning and 2 times more performance in GPT-3 175B pre-training. The company also participated in all benchmarks, including Stable Diffusion v2 training, where the new generation outperformed the old by a factor of 1.7.

However, the old Hopper generation still shows improvements: Compared to the last MLPerf Training benchmark, Hopper showed a factor of 1.3 better performance in language model pre-training. Nvidia has also set a new scaling record, submitting 11,616 Hopper GPUs for the GPT-3 175B benchmark.

Blackwell optimizes Tensor Cores and high-bandwidth memory

According to Nvidia, the Blackwell architecture uses new kernels for more efficient utilization of the Tensor Cores - this is also why the "performance per watt" calculation is supposed to be better than Hopper's, despite the higher power consumption. The company did not provide exact figures on emissions.

The higher compute throughput per GPU and the larger and faster high-bandwidth memory of Blackwell also make it possible to run the GPT-3 175B benchmark with only 64 GPUs. For the same benchmark with Hopper, 256 GPUs would still have been needed to achieve the same performance.

Nvidia wants to deliver more performance and places more focus on inference

In the presentation, Nvidia also showed the performance leaps that the Hopper generation has made through software and network updates in the latest MLPerf benchmarks. Since this is the first Blackwell submission, the company expects similar improvements for future submissions. Next year, the next AI accelerator, Blackwell Ultra, is scheduled to hit the market - it is expected to offer more memory and more computing power.

Blackwell made its debut in the MLPerf Inference v4.1 benchmark for AI inference just last September. There, Nvidia's AI GPU accelerator delivered up to four times more performance per GPU than the H100 with Llama 2 70B, also through the use of lower FP4 precision. According to Nvidia, this should have no impact on the results. The company also sees a new trend towards scaling inference-time compute, as current trends such as low-latency chatbots and AI models that 'think', such as OpenAI's o1 model, continue to drive demand in that sector.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.