Meta Builds Large-Scale AI Network Based on RoCEv2 Protocol for Distributed AI Training

By:Jacob Published 2024-08-06T23:39:29Z

TapTechNews August 7th news, Meta company released a blog post on August 5th, indicating that in order to meet the network requirements of large-scale distributed AI training, a large-scale AI network based on the RoCEv2 protocol was built.

The full name of RoCEv2 is RDMA Over Converged Ethernet version 2, which is a communication and transmission mode between nodes and is used in most artificial intelligence capacities.

Meta has successfully expanded the RoCE network from the prototype stage to the deployment of numerous clusters, each of which can accommodate thousands of GPUs.

These RoCE clusters support a wide range of production-level distributed GPU training tasks, including ranking, content recommendation, content understanding, natural language processing, and GenAI model training and other workloads.

Meta has specifically established a dedicated back-end network for distributed AI training, which can develop, operate, and expand independently of other parts of the data center network.

The training cluster depends on two separate networks: the front end (FE) network for tasks such as data ingestion, checkpoints, and logging, and the back end (BE) network for training, as shown in the following figure:

Meta Builds Large-Scale AI Network Based on RoCEv2 Protocol for Distributed AI Training_0

The FE network hierarchy of the training rack includes rack switches (RSW), fabric switches (FSW), etc., which contains storage warehouses to provide the input data required by the GPU for training workloads.

Meta Builds Large-Scale AI Network Based on RoCEv2 Protocol for Distributed AI Training_1

The back-end structure is a dedicated structure that connects all RDMA network cards in a non-blocking architecture, regardless of their physical location, providing high bandwidth, low latency, and lossless transmission between any two GPUs in the cluster.

Meta Builds Large-Scale AI Network Based on RoCEv2 Protocol for Distributed AI Training_2

Meta Builds Large-Scale AI Network Based on RoCEv2 Protocol for Distributed AI Training_3

In order to meet the demand for GPU scale in LLM model training, Meta has designed an aggregated training switch (ATSW) layer to interconnect multiple AI areas. In addition, Meta also optimizes aspects such as routing and congestion control to improve network performance.

TapTechNews attaches the reference address

Meta RoCEv2 AI network training