Speed, supply chains, and strategy converge in Nvidia's $20 billion quasi-acquisition of Groq
Nvidia is paying a reported $20 billion for Groq's chip technology and top engineers. The deal addresses memory costs, inference competition, and the rise of AI agents all at once.
At first glance, the Groq deal looks like an expensive purchase of technology Nvidia could build itself. Some reports put the valuation around $20 billion. Given Groq's $6.9 billion valuation after its September funding round, that would represent nearly a 3x premium. Neither company has confirmed the numbers or released any financial details.
Looking at the bigger picture, though, Nvidia appears to be tackling several structural problems at once with this quasi-acquisition. Here are the likely reasons, though several may overlap.
Specialized chips beat one-size-fits-all
Nvidia defines an AI factory as infrastructure covering the entire AI lifecycle: data acquisition, training, and inference. The company's Enterprise AI Factory Design Guide emphasizes latency and throughput requirements for real-time inference and complex agent interactions.
Not every AI task needs the same hardware. In an internal email first reported by CNBC, Jensen Huang wrote that Nvidia will integrate Groq's low-latency processors into its AI factory architecture to route different workloads to the right chips.
GPUs with lots of HBM memory remain the workhorse for training and bulk processing. Groq's SRAM architecture will handle real-time applications like voice agents or autonomous systems. This means Nvidia won't need expensive HBM GPUs for every latency-critical task.
Bank of America sees it similarly, calling the deal "surprising, strategic, expensive, offensive, defensive, complementary" all at once. The analysts argue that Nvidia recognizes the rapid shift from training to inference may require more specialized chips. The chipmaker could also use its platform dominance to neutralize competitive threats from other specialty chipmakers.
Memory prices are part of the picture
According to TrendForce, Samsung and SK hynix have raised HBM3e delivery prices for 2026 by nearly 20 percent. Samsung reportedly hiked prices on some memory chips by up to 60 percent in November 2025 compared to September. DDR5 spot prices have surged 307 percent since early September 2025.
Reuters reported in October that SK hynix had already sold out its entire 2026 production. Another Reuters report shows HBM4 includes a customized "base die," the bottom layer of the memory stack, which is more customer-driven in HBM4. This makes switching to competitor products harder and adds even more pressure to the supply chain.
Nvidia already flagged this risk in its FY2025 Form 10-K from January 2025: "To secure future supply and capacity, we have paid premiums, provided deposits, and entered into long-term supply agreements and capacity commitments, which have increased our product costs and this may continue." According to Reuters, Jensen Huang confirmed the price increases but stressed that Nvidia had locked in significant volumes.
SRAM-first architecture cuts HBM dependence
SRAM is very fast memory built directly on the chip. HBM is also fast but sits outside the compute die and is part of a broader supply chain with the bottlenecks described above.
Groq's LPU architecture uses on-chip SRAM as the primary weight memory for models, not just as a cache. This cuts dependence on external HBM but limits model size per chip. Large models must be spread across many chips.
The trade-off makes sense for latency-sensitive tasks. Investor Gavin Baker argued on X that inference splits into prefill and decode phases. SRAM architectures have advantages in the decode phase because fast memory access matters more there than total capacity. This would give Nvidia an inference path optimized for low latency.
SRAM could speed up Mixture of Experts models
Modern AI models like Deepseek V3 use Mixture of Experts (MoE): only some experts are active per query. In Deepseek V3, that's 37 out of 671 billion parameters.
Semi-analyst Zephyr writes on X that MoE models typically have shared experts and some dense layers active for every inference. It makes sense to keep their weights in SRAM while rarely used experts stay in HBM.
For Deepseek V3, Zephyr calculates that the always-active components in FP8 come to just under 3.6 gigabytes. For this to actually land in SRAM, Nvidia would need to size this memory in the hardware on purpose, or split the permanently active core across multiple chips so it's available locally. Zephyr estimates the throughput advantage at 6 to 10 percent. That doesn't sound like much, but with hardware spending of $300 billion per year, that's real money, he adds.
Baker sees the deal as part of a larger chip strategy: Nvidia could offer several Rubin variants in the future. One for high memory capacity during prefill, one as a balanced solution for training and batched inference, and a third for low latency during decode with more SRAM. Baker predicts most custom chips will eventually get canceled, with exceptions like Google's TPU, Tesla's AI chips, and Amazon's Trainium.
Small models and fast chips
The Groq deal fits into a broader Nvidia strategy. In August 2025, Nvidia researchers published a paper pushing for more use of small language models with fewer than 10 billion parameters in AI agents. Such models handle 40 to 70 percent of typical agent queries and cost 10 to 30 times less than large models.
A model with 7 billion parameters in FP8 precision needs around 7 gigabytes for weights, plus additional memory for runtime data like KV cache. Models with 70 billion parameters, by contrast, require distribution across many more chips. Groq's SRAM-first architecture is a natural fit for this kind of workload: agent systems with many short queries where low latency matters more than running the largest possible model.
Predictable latency beats peak throughput
GPUs distribute computing tasks dynamically at runtime. This maximizes throughput but can cause unpredictable delays. Individual requests sometimes take significantly longer than average.
Groq's LPU works differently: the entire chip operates like an orchestra keeping time. All parts execute the same instruction simultaneously, just on different data. The compiler plans all calculations in advance.
According to Groq's technical blog, this "static scheduling" enables constant response times regardless of how many requests are coming in. For language agents or real-time decisions, this can matter more than maximum throughput.
The deal blocks competitors in inference
Nvidia dominates training but faces stronger competition in inference from AMD and startups like Groq and Cerebras. Both have announced projects in the Middle East.
If Groq had continued scaling independently, the startup might have become a go-to option for latency-sensitive inference clusters, putting price pressure on Nvidia's business. Google could also have been interested in Groq to strengthen its TPU business.
Talent may be the real prize
According to Groq's announcement, Jonathan Ross, Sunny Madra, and other engineers are moving to Nvidia. Ross is a TPU veteran who helped develop Google's TPU before founding Groq in 2016.
What makes this significant is that these engineers built the hardware, software, and compiler as a complete system. Nvidia now has a proven team that knows how to build an inference chip from the ground up. This matters because, unlike GPUs, Groq's static scheduling approach has no wiggle room at runtime. The compiler plans every cycle in advance and must know exactly how the hardware will behave.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe nowAI news without the hype
Curated by humans.
- Over 20 percent launch discount.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.