The latest round of the MLPerf Inference Benchmark is dominated by Nvidia's H200 GPU - but the competition is barely showing up. Instead of a benchmark to compare different available AI chips, Nvidia seems to be competing against itself.
In the latest MLPerf Inference benchmarks, Nvidia is ahead with its Hopper GPUs, especially the H200 model. The H200 has 76% more HBM3e memory and 43% more bandwidth than the H100. The benchmark suite was expanded for the first time to include tests with the large Llama 2 70B and Stable Diffusion XL models.
The memory-enhanced H200 GPUs generated up to 31,000 tokens/second in their MLPerf debut with TensorRT-LLM, a record for the MLPerf Llama 2 benchmark. Even if Nvidia's GPUs only deliver a fraction of this performance in practice, it would still be at Groq's speed level.
In the "Open Division", Nvidia also demonstrated three techniques for speeding up inference: Structured Sparsity, Pruning, and DeepCache. They are said to increase efficiency by up to 74%.
Nvidia competes with Nvidia, Intel also joins in
Nvidia was the only vendor to provide results in all tests. Intel participated with Gaudi2 and CPU results, Google only contributed a TPU v5e result. Gaudi2 did not reach the performance of Nvidia, but according to Intel it should offer a better price-performance ratio. Intel will probably try to exploit this advantage with the next generation Gaudi3. However, Gaudi3 was completely absent, as were AMD's MI300X and Cerebra's solution. Qualcomm's Cloud AI cards made an appearance, but were underwhelming.
To summarize: The MLPerf benchmark is increasingly becoming an Nvidia benchmark where the company is competing against itself. The other vendors are holding back - but still seem to find customers for their AI accelerators. Maybe the situation will change next year when Nvidia releases its new Blackwell generation and the new chips from AMD and Intel are in use.