Nvidia is working to resolve thermal issues in its high-density server racks containing 72 of its new Blackwell AI chips, according to The Information.
Sources close to the company say the engineering team has made multiple revisions to the rack designs to address the heat problems. The server configurations are particularly challenging to manage due to their high chip density and water cooling systems.
While some customers have received early units, the overheating issues persist as Nvidia scales up production. Cloud service providers are growing concerned about potential delays in deploying their GPU infrastructure.
Some are considering purchasing additional Hopper-generation chips as a backup plan. This could boost Nvidia's near-term sales, but might affect revenue growth later if Blackwell deployments are delayed.
Nvidia maintains that adjustments during data center integration are standard practice. "The engineering iterations are normal and expected," a spokesperson for Nvidia told The Information.
Nvidia CEO Jensen Huang told attendees at a Goldman Sachs technology conference in San Francisco that customers have become "more emotional" about receiving Nvidia components, as delivery times now directly affect their competitive position and revenue.
Early Blackwell Benchmarks
Early performance tests suggest that Blackwell could deliver up to twice the AI training speed of previous generations, despite current thermal management challenges. The company expects further performance leaps through software and networking updates.
As there's a new trend towards scaling inference time, Nvidia is putting more focus on inference. In the MLPerf Inference v4.1 AI inference benchmark just last September, the H100 with Llama 2 70B delivered up to four times more performance per GPU.