In-depth exploration of the world's largest AI supercluster xAI Colossus

Wallstreetcn
2024.11.02 01:48
portai
I'm PortAI, I can summarize articles.

xAI Colossus is the world's largest AI supercluster, investing hundreds of millions of dollars, equipped with 100,000 NVIDIA H100 GPUs, and plans to expand to 200,000. The cluster was deployed in 122 days, featuring over 1,500 racks and an array configuration of 512 GPUs. Each server supports 9 400GbE network connections, with a total bandwidth of 3.6Tbps, utilizing a Supermicro liquid cooling system, and has high maintainability

xAI Colossus Data Center Overview

I. GPU Computing System

GPU: Colossus currently has 100,000 NVIDIA Hopper GPUs deployed and plans to expand to 200,000, including 50,000 H100 and 50,000 H200. All GPUs are integrated on the NVIDIA HGX H100 platform, with each platform containing 8 GPUs.

Rack Configuration: Each rack can accommodate 64 GPUs, with 8 racks forming an array, totaling 512 GPUs. Colossus has over 1,500 racks, close to 200 arrays.

Servers: Supermicro 4U general-purpose GPU liquid cooling system. The internal structure of the servers includes:

8-GPU NVIDIA HGX tray: Utilizes Supermicro's custom liquid cooling module, with each tray containing 8 NVIDIA H100 or Hopper GPUs and an NVIDIA NVLink switch.

CPU Tray: Equipped with two x86 CPU liquid cooling blocks and a custom liquid cooling block for cooling four Broadcom PCIe switches.

Maintainability: Supermicro systems feature a maintainable tray design, allowing maintenance without removing the entire machine from the rack. Each server is equipped with four hot-swappable power supplies.

Network: Each server is equipped with 9 400GbE network connections, achieving a total bandwidth of 3.6Tbps. Among them, 8 NVIDIA BlueField-3 SuperNICs are used for AI networking, while 1 Mellanox ConnectX-7 network card provides additional networking functions for the CPU side.

II. CPU Computing System

Servers: Supermicro 1U servers, 42 units per rack.

CPU: High-speed x86 CPUs, specific models unknown.

Network: Each server is equipped with one 400GbE network card.

Cooling: CPU servers utilize air cooling design, transferring heat to the liquid cooling loop through a heat exchanger at the rear of the rack.

III. Storage System

Scale: EB-level storage.

Medium: NVMe SSD.

Servers: Supermicro 1U servers.

Features: To meet the enormous storage capacity demands of AI training, Colossus's storage is primarily delivered over the network for access by all GPU and CPU servers.

IV. Network System

GPU Network:

Technology: Utilizes 400GbE Ethernet, employing NVIDIA Spectrum-X networking solutions, supporting RDMA technology.

Switches: NVIDIA Spectrum-X SN5600 Ethernet switches, each with 64 ports, supporting speeds of up to 800Gb/s and can be split into 128 400GbE links

Network Card: NVIDIA BlueField-3 SuperNIC, providing dedicated network connections for each GPU.

Storage Network: Utilizing 400GbE Ethernet, with a 64-port 800GbE Ethernet switch.

Features: The network system of Colossus uses Ethernet instead of technologies like InfiniBand, mainly because Ethernet offers better scalability to meet the massive scale requirements of Colossus. The GPU network and CPU network are separated to ensure optimal performance of the high-performance computing cluster.

V. Cooling System

GPU Servers:

Cooling Method: Liquid cooling.

CDU: Each rack is equipped with a Supermicro CDU and a redundant pump system at the bottom.

Coolant Circulation: The coolant enters the distributor of each server through the rack distribution pipes, flows through the liquid cooling blocks of the 8-GPU NVIDIA HGX tray and CPU tray inside the server, and finally returns to the CDU.

Others: The rack still retains a fan system to cool low-power components such as memory, power supply units, motherboard management controllers, and network cards.

CPU Servers, Network Devices, and Storage Systems: Air cooling, transferring heat to the liquid cooling circuit through a heat exchanger at the rear of the rack. The heat exchanger is similar to a car radiator, drawing hot air through the fins with a fan and transferring heat to the circulating water.

Data Center: A chilled water circulation system is used, where the CDU transfers heat to the circulating water, and the hot water is cooled outside the facility before being reused. A large water supply pipeline brings cold water into the facility and circulates through each rack's CDU, absorbing heat, after which the hot water is directed to the cooling equipment outside the facility.

VI. Power System

Power Supply: A three-phase power supply is used, with multiple power strips equipped for each rack.

Energy Storage: Tesla Megapack battery packs are used as energy buffers between the supercomputer and the grid, with each Megapack capable of storing up to 3.9MWh of electrical energy. The introduction of Megapack is to address the pressure on the grid caused by power fluctuations of GPU servers.

VII. Others

Monitoring System: Each rack's CDU has an independent monitoring system that can monitor parameters such as flow and temperature. Additionally, LED indicators are installed at the rear of the rack to display device status, with blue indicating normal operation and red indicating a fault.

xAI Colossus Data Center Computing Hall

Through an in-depth visit to the xAI Colossus supercomputer, we experienced the shock brought by the large-scale AI computing power deployed by xAI in Memphis, Tennessee This AI computing cluster, with a total investment of hundreds of millions of dollars and equipped with 100,000 NVIDIA H100 GPUs, not only attracts industry attention due to its scale but also sets a record for its construction speed—its team completed the deployment of the entire cluster in just 122 days. Now, let’s take a look inside this facility.

xAI's Liquid Cooling Rack Technology

The core building unit of the Colossus computing cluster is Supermicro's liquid cooling rack system. Each rack integrates eight 4U servers, with each server equipped with eight NVIDIA H100 GPUs, resulting in a total GPU capacity of 64 GPUs per rack. A complete GPU computing rack consists of eight GPU servers, a Supermicro Cooling Distribution Unit (CDU), and supporting equipment.

Low-angle view of xAI Colossus Data Center Supermicro Liquid Cooling Node

These racks are deployed in groups of eight, with each group supporting 512 GPUs and equipped with network interconnection facilities to form computing sub-clusters in larger-scale systems.

xAI Colossus Data Center Supermicro 4U General GPU Liquid Cooling Server

xAI uses Supermicro's 4U general GPU system, which is currently the most advanced AI computing server on the market, with advantages primarily in two areas: leading liquid cooling technology and excellent maintainability.

xAI Colossus Data Center Supermicro 4U General GPU Liquid Cooling Server

These systems were first showcased at the 2023 Supercomputing Conference (SC23). Unfortunately, we were unable to unbox the system on-site in Memphis as it was executing training tasks during our visit. It is worth mentioning that the system features a serviceable tray design, allowing maintenance without removing the entire unit from the rack. The 1U rack distribution pipeline is responsible for delivering coolant to each system and recovering heated liquid. Quick-disconnect fittings make the disassembly of the liquid cooling system convenient; we demonstrated the ease of operating these fittings with one hand last year. Once the fittings are disconnected, the tray can be easily pulled out for maintenance.

Supermicro 4U Universal GPU System for Liquid Cooling NVIDIA HGX H100 and HGX 200 (Displayed at SC23)

The images of these server prototypes showcase the internal structure of the system. In addition to using Supermicro's custom liquid cooling module for the 8-GPU NVIDIA HGX tray, the design of the CPU tray fully demonstrates the industry-leading next-generation engineering concept.

Supermicro 4U Universal GPU System for Liquid Cooling NVIDIA HGX H100 and HGX 200 (Displayed at SC23)

The two x86 CPU liquid cooling blocks in the SC23 prototype are quite common. The uniqueness lies on the right side. Supermicro's motherboard integrates four Broadcom PCIe switches, which are used in almost all of today's HGX AI servers, rather than placing them on separate boards. Supermicro then has a custom liquid cooling block to cool these four PCIe switches. Other AI servers in the industry are built first and then liquid cooling is added to air-cooled designs. Supermicro's design was intended for liquid cooling from the start and all components come from a single supplier.

Supermicro SYS 821GE TNHR, NVIDIA H100 and NVSwitch Liquid Cooling Module

This can be likened to the automotive field: some electric vehicles are modified from traditional fuel vehicle chassis, while others are designed as pure electric vehicles from the ground up. Supermicro's system belongs to the latter category, while other HGX H100 systems are similar to the former. We have actually tested most publicly available HGX H100/H200 platforms and some hyperscale designs, and the advantages of Supermicro's system compared to other systems (including Supermicro's own other liquid-cooled or air-cooled designs) are significantly evident.

The rear of the rack is equipped with 400GbE fiber optics for interconnection between GPU and CPU components, as well as copper cables for management networks. The Network Interface Card (NIC) features a separate tray design for quick replacement without removing the chassis, located at the rear of the chassis. Each server is equipped with four hot-swappable power supplies powered by a three-phase distribution unit (PDU).

xAI Colossus Data Center Supermicro 4U Universal GPU Liquid Cooling Server Rear View At the bottom of the rack, there is a cooling distribution unit (CDU), which is essentially a large heat exchanger. Each rack has an independent fluid circulation system that provides cooling for all GPU servers. The term "fluid" is used instead of "water" because the circulation system requires specific cooling liquids based on the characteristics of the liquid cooling blocks, pipes, distributors, and other hardware.

xAI Colossus data center Supermicro CDU is located at the bottom of the rack.

Each CDU is equipped with redundant pumps and power supplies, allowing for replacement without downtime in the event of a single pump failure.

Disassembling the Supermicro CDU pump.

The xAI racks are feature-rich, and in related videos from 2023, we detailed the structure of the Supermicro CDU, including the water circuit in the machine room and the inlet and outlet interfaces of the rack distribution pipes, as well as the hot-swappable redundant power supplies for each CDU.

Rear view of the Supermicro CDU 2023.

The CDU in the Colossus rack is obscured by various pipes and cables.

Rear view of the xAI Colossus data center Supermicro CDU.

Both sides of the rack are equipped with three-phase PDUs and rack distribution pipes. The front-mounted 1U distribution pipe supplies liquid to the 4U universal GPU system, which is fed by the rack distribution pipe connected to the CDU. All components are marked with red and blue color coding, with red indicating the hot fluid circuit and blue indicating the cold fluid supply.

xAI Colossus data center Supermicro rack distributor hose.

The rack still retains a fan system to cool low-power components such as memory (DIMM), power supply units, baseboard management controllers (BMC), and network cards. In Colossus, each rack needs to maintain cooling balance to avoid the use of high-power air handling equipment. Server fans draw in cold air from the front and exhaust it from the back, which is then processed through the rear door heat exchanger

xAI Data Center Back Door Heat Exchanger

The principle of the back door heat exchanger is similar to that of a car radiator, processing the hot air expelled from the rack through a heat exchanger with heat sinks. The fluid in the heat exchanger can transfer heat to the water system in the machine room. Air is drawn in by fans at the back of the equipment. These devices have LED indicators that display blue light during normal operation and change to other colors (such as red) when maintenance is needed.

During the site visit, although I did not turn on several racks, I found it very interesting to see the different color changes of these heat exchangers when the racks were powered on.

xAI Data Center Back Door Heat Exchanger

These back door heat exchangers play a dual role in the data center: they not only handle the waste heat from Supermicro liquid-cooled GPU servers but also manage the heat generated by storage systems, CPU computing clusters, and network devices.

xAI's Storage System

In a typical AI computing cluster, large storage arrays are standard. In this project, although storage software from different vendors is running, the vast majority of storage server hardware is provided by Supermicro. This is understandable, as Supermicro is an OEM for several storage device suppliers.

xAI Colossus Data Center Supermicro 1U NVMe Storage Node

One striking detail during the site inspection was that some storage servers looked very similar to CPU computing servers.

xAI Colossus Data Center Supermicro 1U NVMe Storage Node

From our pictures and video records, we can see a large number of 2.5-inch NVMe hard drive trays. Large-scale AI computing clusters are undergoing a transition from mechanical hard drive storage to flash storage (SSD). Flash storage not only significantly reduces energy consumption but also provides higher performance and storage density. Although the initial investment for each PB of flash is relatively high, from a TCO perspective, flash is often a more cost-effective choice for clusters of this scale.

xAI's CPU Computing System

In large-scale computing clusters, traditional CPU computing nodes still occupy an important position. Compared to GPUs, CPUs still have unique advantages in data processing and operational tasks In addition, focusing GPU resources on AI training or inference workloads while allowing the CPU to handle other computational tasks is a more efficient resource allocation strategy.

xAI Colossus Data Center CPU Computer Rack

On-site, we observed rows of 1U servers. Each server's design achieves a delicate balance between computing density and cooling requirements. For example, about one-third of the front panel is used for cold air intake, while the remaining portion is arranged with NVMe hard drive trays marked with orange labels.

xAI Colossus Data Center CPU Computer Rack

These 1U computing servers use air cooling design, transferring heat to the facility water cooling system through a Rear Door Heat Exchanger. This design allows xAI to accommodate both liquid-cooled and air-cooled equipment's cooling needs within the same data center infrastructure.

xAI's Network

The network is one of the most attention-grabbing aspects of this project. Although the underlying technology is still Ethernet, the same network protocol used by ordinary computers, it employs a 400GbE network, which has a transmission rate 400 times that of the common 1GbE network. Each system is equipped with nine such connections, resulting in a total bandwidth of an astonishing 3.6Tbps for a single GPU computing server.

xAI Colossus Data Center Network Interface Card (NIC)

Data transmission for GPUs primarily relies on RDMA networks. Each GPU is equipped with a dedicated network card, and the project uses NVIDIA BlueField-3 SuperNIC and Spectrum-X network solutions. NVIDIA's network technology stack has unique advantages, ensuring efficient and precise data transmission within the cluster.

xAI Colossus Data Center Switch Fiber It is worth noting that, unlike most supercomputers that use technologies like InfiniBand, this project has chosen Ethernet. This choice is strategically significant—Ethernet, as the foundational protocol of the internet, offers exceptional scalability. Today's large-scale AI clusters have surpassed the coverage of many complex proprietary technologies, and the xAI team has made a visionary attempt in this regard.

In addition to the RDMA network for GPUs, the CPU system is also equipped with a separate 400GbE network, utilizing a completely different switching architecture. This design, which separates the GPU network from the regular cluster network, is a best practice in high-performance computing (HPC) clusters.

xAI Colossus data center single-mode and multi-mode fiber

To intuitively understand the performance of 400GbE, the bandwidth of a single link exceeds the total PCIe channels of the top Intel Xeon server processors released in early 2021, with each server equipped with nine such connections.

xAI Colossus data center switch stack

Such dense network interconnections require extensive fiber cabling. Each fiber is precisely cut, terminated, and managed for identification.

xAI Colossus data center fiber cabling

I met some personnel working on this in August. Their structured cabling is always done very neatly.

xAI Colossus data center fiber cabling

In addition to the high-speed cluster network, the facility has also deployed a low-speed network for managing interfaces and environmental equipment, which is an essential component of large-scale clusters.

During the site visit, the need for liquid-cooled network switches was evident. The 64-port 800GbE switch we recently evaluated performs comparably to the 51.2T switches used by most AI clusters. The challenge facing the industry is how to simultaneously address the heat dissipation issues of switching chips and optical components, the latter of which often consume more power in modern switches. Deployments of this scale may drive the development of co-packaged optics technology, allowing switch cooling to be perfectly integrated with liquid-cooled computing systems. We have previously seen prototypes of liquid-cooled co-packaged optics switches and look forward to this deployment promoting these technologies from experimentation to mass production xAI Colossus's Data Center Facilities

Due to our use of liquid-cooled AI server racks, power and facility water are crucial for installation. Here, the massive water supply pipelines are shown, divided into cold and hot water groups. Cold water is introduced into the facility and circulates through the cooling liquid distribution units (CDUs) in each rack. Heat is transferred from the GPUs and rear door heat exchanger loop to the facility water loop of the CDU. The hot water is then directed to cooling equipment outside the facility. Notably, this cooling equipment is not used for ice-making but is designed to lower the water temperature to a level suitable for reuse.

xAI Colossus Data Center Facility Water Pipeline

The power system is equally impressive. During our visit to Memphis, we witnessed the team moving massive cables into place.

Electrical Infrastructure of xAI Colossus Data Center

Outside the data center facilities, we saw containers loaded with Tesla Megapacks. This was a significant discovery for the team during the construction of this massive cluster. The power consumption of AI servers is not constant but varies with workload fluctuations. Due to the large number of GPUs deployed on-site, the power peak and valley phenomenon is quite pronounced. The team found that millisecond-level power peaks and valleys put significant stress on the system, hence the introduction of Tesla Megapacks to buffer power peaks, thereby enhancing system stability.

Tesla Megapacks ready for installation at xAI Colossus

Of course, this is just the beginning of the facility construction. Although the initial cluster of four 25,000 GPU data centers was operational during our visit, capable of supporting approximately 100,000 GPUs, the expansion work of the cluster is progressing rapidly.

Exterior of the xAI Colossus Data Center in Memphis

This is undoubtedly an exciting start.

Summary

Throughout this process, I deeply realized the tremendous effort the xAI team has put into coordinating numerous suppliers. The construction of such a massive AI cluster relies on the collaboration of experts from various fields, who have created a miracle at an incredible speed. From what I observed on the day I filmed the video, it is hard to imagine the immense dedication behind it The AI community generally believes that with the continuous improvement of computing power, the potential of large language models (LLMs) will go far beyond chatbots. Walking through Colossus, I deeply feel that only when people see the immense value driven by data will they invest such huge resources in construction. The future of Grok and the xAI team will undoubtedly surpass simple chatbots. Many talented individuals are pouring significant effort and financial resources into striving to realize this vision as soon as possible