Jensen Huang GTC Interview: Low-latency inference will become the next explosive engine of the AI economy, and the supply-demand balance of power chips will continue for a long time

Wallstreetcn
2026.03.17 12:05
portai
I'm PortAI, I can summarize articles.

The improvement of AI reasoning capabilities is shifting models from "generating information" to "executing tasks," beginning to generate real economic value for the first time. Low-latency reasoning will become a new commercial payment engine. On the supply side, power, chips, and data center construction are almost all lacking redundancy, and tight balance may become a longer-term industry backdrop

AI is moving from "generating information" to "executing tasks," with low-latency, high-throughput inference scenarios represented by coding agents opening the next important phase of AI infrastructure commercialization. On the supply side, power, chips, and data center construction are almost all lacking redundancy, and tight balance may become a longer-term industry backdrop.

After the keynote speech at GTC 2026, NVIDIA CEO Jensen Huang accepted an exclusive interview with Ben Thompson, founder of Stratechery, where he expressed systematic views on core topics such as the AI inference economy, CPU strategy, the logic behind acquiring Groq, and supply chain tensions.

In the interview, Huang pointed out that AI has crossed a critical threshold in the past year—the enhancement of inference capabilities has allowed models to begin generating real economic value for the first time, and the explosion of programming agents is the clearest manifestation of this shift. NVIDIA has officially incorporated ultra-high-speed, low-latency inference into its product portfolio.

On the supply side, Huang candidly stated that “almost every link is tight,” and it is difficult to easily double the supply of power or chips. Although NVIDIA claims its supply chain has been planned for "this year and next," he hopes that "land, power, and data centers" can be established more quickly, which will directly affect the pace of computing power expansion and the path to capital expenditure realization.

Inference Economy: Low Latency Becomes the Next Payment Engine

Huang attributes the core breakthrough in AI development over the past year to the maturity of "inference" capabilities. He stated that generative AI was difficult to commercialize early on due to hallucination issues, while the introduction of inference capabilities allows models to achieve "grounding" through reflection, retrieval, and search, thus elevating from providing information to truly completing tasks.

"Search is a service that no one pays for because the barrier to obtaining information is not high enough to make people spend money," Huang said. "We have now crossed that threshold—AI can not only converse with people but also do tasks for them."

Programming is the most typical example he cited. He pointed out that code generation is not an ordinary language modality; it requires the model to reflect on and validate the execution results of code blocks. The maturity of this capability allows engineers to shift their focus from writing code line by line to architecture and specification design.

He revealed that NVIDIA's internal software engineers are 100% using programming agents, "Many people haven't personally written a line of code in a while, but their productivity is extremely high."

Based on this judgment, NVIDIA decided to incorporate low-latency inference capabilities into its product line. Huang explained that existing GPU systems have an inherent tension between maximizing throughput and maximizing the quality of intelligent tokens, and for high-value programming agent users, they are willing to pay a premium for a 10-fold increase in token generation speed.

"If Anthropic launches a Claude Code service layer that increases programming speed by 10 times, I would pay for it, no doubt. I am building this product for myself."

Acquisition of Groq: Strategic Layout to Deconstruct the Inference Pipeline

NVIDIA's decision to acquire Groq, in Jensen Huang's view, is not a sudden move but a natural extension of its years of investment in the inference infrastructure field.

He stated that NVIDIA began considering how to break down inference processes more granularly on heterogeneous infrastructure when it released the Dynamo inference scheduling framework a year ago. The collaboration with Groq started about six months before the acquisition announcement. The core of this transaction is to acquire the Groq team and technology licensing, rather than its cloud service business.

On a technical level, NVIDIA will extend the inference pipeline breakdown to the internal decoding stage, with the Vera Rubin GPU handling high FLOP attention calculations, while Groq's LPU architecture will take on parts that require extremely high token rates and very low latency. Related products are planned to be launched within this year.

He said:

"But if your business is similar to Anthropic or OpenAI, Codex is generating real economic value, and you want to generate more tokens, then joining this accelerator can significantly boost revenue."

He also acknowledged that this solution is not suitable for all customers. For platforms primarily consisting of free users with low paid conversion rates, introducing Groq would increase costs and complexity, which is not cost-effective.

Jensen Huang compared Groq to the previous acquisition of Mellanox—both represent NVIDIA's consistent logic of incorporating external specialized architectures into its computing stack to achieve system-level collaborative optimization. "NVIDIA is an accelerated computing company, not a GPU company; we are not fixated on where the computation happens; we just want to accelerate applications."

CPU Strategy: Redefining Server Architecture for the AI Agent Era

Against the backdrop of NVIDIA being long positioned as a GPU company, Huang systematically elaborated on the logic behind NVIDIA's entry into the CPU market and explained the design philosophy of its self-developed Vera CPU.

He pointed out that the design orientation of CPUs over the past decade has been optimized for hyperscale cloud computing—with the goal of maximizing the number of rentable cores, while single-thread performance has not been a priority. However, in AI agent scenarios, the single-thread performance of the CPU directly determines the overall efficiency of the system while the GPU waits for tool invocation results. "You can never let the GPU time idle," he said.

The core differentiation of the Vera CPU lies in memory bandwidth and I/O bandwidth: its bandwidth per CPU core is three times that of any current CPU, designed specifically to avoid bottlenecks that would slow down the GPU due to I/O. He also introduced the collaboration with Intel on NVLink to meet the enterprise computing market's demand for continuity in the x86 ecosystem.

Huang categorized the tool usage for AI agents into two types: one is structured tools, including CLI, API, and database queries; the other is unstructured tools, including PC applications that require models to operate web interfaces through multimodal perception. NVIDIA has laid out plans in both paths.

Supply Tight Balance: Both Power and Chip Capacity Are in Crisis

In response to the market's ongoing concerns about AI computing power supply, Jensen Huang provided the most direct judgment to date: Both electricity and chip production capacity are in a tight balance, and there is no room for doubling in the short term.

"I don't believe we have twice the electricity demand, nor do we have twice the chip supply; there is no redundancy in any aspect," he said. "But from what I see in the current outlook, our supply chain can support it."

He stated that NVIDIA has about 200 long-term partners in its supply chain and has proactively planned upstream and downstream, holding an optimistic view on large-scale growth in the next two years.

However, he admitted that the biggest bottleneck may not be the chips themselves, but rather the speed of land, electricity, and construction for data centers. "What I hope the most is that these infrastructures can be completed faster."

When asked if NVIDIA is the biggest beneficiary of the scarcity of computing power, Huang acknowledged that the company is the largest and has the most prepared supply chain, but attributed this to long-term planning rather than a coincidental market advantage