Intel vs. NVIDIA, who will emerge victorious?
With the emergence of ChatGPT, not only has it indicated a feasible direction for AI technology implementation in the industry, but it has also sparked a new round of competition in AI hardware technology. However, AGI and LLM, driven by ChatGPT, have yet to produce an AI product that can directly compete with NVIDIA GPUs.
Just like the famous saying of Emperor Wu of Han, "If the enemy can go, so can I," AMD came before, and Intel followed after, both demonstrating their attitude of "If NVIDIA can do it, so can I" through practical actions.
On July 11th, Intel launched the high-end AI processor Habana Gaudi 2, targeting the Chinese market, to support accelerated AI training and inference tasks. What sets it apart is that it is an AI accelerator mounted on an Intel Xeon CPU, not a GPU.
Gaudi 2 provides the market with a new choice beyond GPUs. Can Intel successfully challenge NVIDIA with this?
What are the advantages of this second option?
Gaudi 2 was already released overseas in May 2022. This time, a customized version was launched for the Chinese market, just like NVIDIA introduced the specialized compliant versions "A800" and "H800" for the Chinese market.
Developed by Habana Labs, Gaudi 2 is an AI chip startup company founded in Israel in 2016, providing programmable deep learning accelerators for data centers. In 2019, Habana Labs launched the first generation of Gaudi, and in December of the same year, Intel, one of its early investors, acquired the company for $2 billion.
Currently, Habana Labs has released two series of AI products. The Gaudi series is designed for AI training, while the Goya series is designed for AI inference.
The customized version of Gaudi 2 introduced by Intel for the Chinese market is the second-generation AI hardware accelerator designed by Habana Labs. Each server contains 8 accelerator devices (HPU: Habana Processing Units), with 96GB of memory per device. The graphics memory capacity is 96GB HBM2E, with a memory bandwidth of up to 2.4TB/s.
Sandra Rivera, Executive Vice President and General Manager of Intel's Data Center and AI Business, did not provide detailed information about the parameters of Gaudi 2, but emphasized the "cost-effectiveness." Additionally, Eitan Medina, Chief Operating Officer of Intel Labs, highlighted that although the international version of Gaudi 2 has fewer 100G ports, "based on customer usage, the expected impact will be minimal."
Currently, based on the available information, Inspur has adopted Gaudi 2, and the new generation AI server NF5698G7, which supports 8 Gaudi 2 deep learning accelerators, has been deployed. According to Intel, Ziguang Xinhua, Super Fusion, and Baidu Intelligent Cloud will also become Gaudi 2 users. Simply put, the Gaudi 2 deep learning accelerator is built on the first-generation Gaudi high-performance architecture and adopts TSMC's 7nm process, specifically designed for training large language models.
According to MLCommons MLPerf benchmark tests (a mainstream AI performance benchmark), the overall performance of Gaudi 2 is higher than that of NVIDIA A100, but lower than NVIDIA H100. The performance per watt when running ResNet-50 is approximately twice that of NVIDIA A100, and when running the BLOOMZ model with 176 billion parameters, the performance per watt is about 1.6 times that of A100.
MLPerf conducts evaluations twice a year. In the evaluation in June this year, besides NVIDIA H100, Gaudi 2 was the only solution that submitted performance results for training the GPT-3 large model.
In addition to adapting to large GPT models (based on the Transformer architecture), Gaudi 2 has shown excellent inference performance in recent Hugging Face evaluations, including running Stable Diffusion (a high-end model for Qualcomm's edge devices) and the BLOOMz model with 7 billion and 176 billion parameters.
For example, compared to NVIDIA A100, Gaudi 2 achieved a 2.21 times reduction in latency when performing inference on the Stable Diffusion model.
It can be said that although Gaudi 2 cannot replace NVIDIA H100, Intel has provided a new solution for LLM inference and training, which is a combination of "CPU (Xeon) + accelerator (Gaudi 2)" other than GPU.
Originally, AGI or LLM training and inference were not limited to GPUs and could also be done using CPUs and AI accelerators.
Rivera believes that users actually have different product requirements. For example, small and medium-sized model users can choose Intel's 4th generation Xeon (CPU) processor (Intel AMX: Advanced Matrix Extension) for inference. If they want to train new models with billions of parameters and require high-level computing power, they can also use Gaudi.
When it comes to large-scale business deployment, Gaudi 2 can achieve more linear performance growth through cluster horizontal scaling.
From the newly released MLPerf Training 3.0 results by MLCommons, it can be observed that when running the GPT-3 model with 175 billion parameters, Gaudi 2 achieves close to 95% linear performance scaling when the number of accelerators increases from 256 to 384.
Intel Xeon Scalable is the only CPU-based version among the numerous solutions submitted to MLPerf 3.0. It supports "out-of-the-box" deployment, which means AI can be deployed on general-purpose systems to improve usability and reduce costs.
Looking at Beauty: Between Plumpness and Boniness
Since the emphasis is on cost-effectiveness, the goal of Gaudi 2 is certainly not to be positioned as a top flagship, but rather to focus on "mass production." This is equivalent to the "mid-to-high-end" type of smartphones, with a focus on capturing as much market share as possible.
This "starting from the mid-range" market strategy has become Intel's main focus in recent years.
In this generative AI battle, Intel combines its own CPU technology advantages with AI acceleration chips, using the fourth-generation Intel Xeon Scalable CPU chip (Intel AMX: Advanced Matrix Extensions) stacked with Gaudi 2 to compete with NVIDIA in the mid-range market.
Among them, the CPU AI inference performance of Intel AMX cannot be ignored; and the ability of AMX in CPU AI inference and training has become the confidence for Intel to combine its traditional technological advantages and promote a strong competitive strategy.
In AI inference workloads, AMX's inference performance surpasses that of NVIDIA A100 GPU by more than 5 times, and is more than 2 times better than AMD's 64-core EPYC CPU; when performing training work, AMX's performance is nearly 3 times better than NVIDIA A100 GPU, and can complete training in seconds or minutes, while significantly reducing user costs.
Intel publicly demonstrated the generation effect of running the Stable Diffusion model on the Xeon Max chip. The Stable Diffusion model can generate images based on text or images, and the results show that the model running on the AMX chip generated an image in just 5.34 seconds.
Intel's AI solution based on the "CPU + AI accelerator" product combination, as Intel puts it, provides a highly competitive choice for customers seeking to break free from the current efficiency and scale limitations in a closed ecosystem.
In this combination, Intel uses Gaudi 2, which has both performance and energy efficiency advantages, to divide user segmentation needs and emphasizes the high cost-effectiveness labels that currently surpass NVIDIA A100 GPU, and in the future, surpass NVIDIA H100 GPU, reduce the time and cost of obtaining GPUs, and energy consumption, etc.—what is cost-effectiveness—the core label of cost-effectiveness is "saving money," thereby eroding NVIDIA's market share in the mid-to-high-end market.
Ease of use and smooth transition with the existing system are also part of Intel's AI market strategy.
"Plug and play" embodies ease of use. Gaudi 2's SynapseAI software suite integrates two common deep learning frameworks, PyTorch and TensorFlow, as well as mainstream LLM training frameworks such as Megatron and DeepSpeed. This means that developers can quickly migrate code to different hardware platforms. How fast is the migration speed? 10 minutes, including the time spent reading the documentation.
From the launch of the dedicated Chinese version of Gaudi 2, to the landing of Inspur's new generation AI server NF5698G7, which adopts Gaudi 2 and two AMX chips, it is evident that Intel attaches great importance to the Chinese market.
Chinese users of Intel's AI products have also expressed their recognition. Liu Jun, Senior Vice President of Inspur and General Manager of AI&HPC Product Line, claimed that their algorithm engineers found that the user experience of the Chinese version of Gaudi 2 is "basically no different from GPU" after actual experience.
No significant difference does not mean no difference at all.
Wall Street News noted that the performance advantage of the Chinese version of Gaudi 2 (including the international version) launched by Intel, compared to Nvidia's A100, is mainly focused on the ResNet model based on the Residual structure, rather than the AGI model based on the Transformer architecture, which is commonly used in GPT. There are significant differences between the two.
Therefore, the market acceptance of the Chinese version of Gaudi 2 and the balance between fullness and slimness still require time to answer.