AMD, Cisco, and Others Team on Ethernet for AI and HPC
From tail latency to training algorithms, the Ultra Ethernet Consortium – hosted by Linux Foundation with founding members across the networking industry – is tackling new challenges and building a complete Ethernet-based stack architecture optimized for artificial intelligence and high-performance computing.
August 25, 2023
As enterprises embrace artificial intelligence (AI), they often find their existing infrastructures are not optimal for AI’s workload demands. A new industry group, the Ultra Ethernet Consortium (UEC), aims to address the issue by building a complete Ethernet-based communication stack architecture for AI and high-performance computing (HPC) workloads.
Given the reliance on standards-based Ethernet in virtually every organization, the group made it clear from the start that the effort will not disrupt anything. “This isn ’t about overhauling Ethernet,” said Dr. J Metz, Chair of the Ultra Ethernet Consortium, in a prepared statement released when the group was announced. “It’s about tuning Ethernet to improve efficiency for workloads with specific performance requirements.”
UEC is a Joint Development Foundation project hosted by The Linux Foundation. Founding members of the group include AMD, Arista, Broadcom, Cisco, Eviden (an Atos Business), HPE, Intel, Meta, and Microsoft.
Addison Snell, CEO of Intersect360 Research, put the need for the group’s work into perspective in a released statement. “Today, there are no standard, vendor-neutral data center networking solutions that focus on performance at scale for parallel applications.” He noted that because the majority of data centers are Ethernet-based, there is a need for something that makes scalability more straightforward and within the reach of companies today.
To that end, the UEC plans to focus on networking aspects that could accelerate AI and other workloads. AI is of particular interest because of the way workloads make use of a network. Specifically, the group points to the need for large clusters of GPUs to train Large Language Models (LLMs) such as GPT-3, Chinchilla, and PALM, as well as widely used recommendation systems. And they note that many HPC workloads also run on distributed environments.
In such environments, a critical limiter of performance is a metric called tail latency. That is a type of measure of overall system responsiveness to complete a distributed task. It is commonly expressed as the percentage of response times that take longer than 98% of response times to handle communication requests. Essentially, it is a measure of how the slowest communication exchange within a distributed system impacts the overall system's performance.
It is a type of weakest link parameter. It does not matter how powerful the GPUs and other compute elements in a distributed system are. The slowest communication exchanges will determine how fast a job can run.
Optimizing Ethernet
The group is looking for ways to reduce tail latency. It has developed a specification that aims to improve several networking elements within an Ethernet system. For example, the specification builds on past work to improve the flow of packets through a distributed system.
For example, the group noted how Ethernet was originally based on spanning tree algorithms to get a packet from point A to point B. Then it adopted multi-pathing, which, as the name implies, uses as many networking links as possible to exchange packets between two points in a distributed network. Multi-path offered performance advantages over spanning tree but could still have performance issues, such as when the technology mapped too many routes to a single network path.
The specification recommends using a technique known as packet spraying, where every attempted communication exchange between two points simultaneously uses all paths between the two points. The group claims this "achieves a more balanced use of all network paths.”
Other techniques within the specification use flexible packet delivery order, modern congestion control, and end-to-end telemetry to optimize the congestion control algorithms.
Large-scale systems need even more help
Another aspect of AI and HPC distributed computing performance has to do with the large amount of data used to train algorithms. A traditional network typically transmits the data as a small number of large flows. In describing the specification, the group noted that AI workloads often cannot execute until all flows are successfully delivered. So, just one overburdened link can slow the entire computation. To address this issue, UEC is looking at ways to improve load-balancing to improve AI performance.
Such efforts build on work already being done in many enterprises and within the networking equipment industry.
Many businesses are already trying to accelerate AI workloads by boosting storage performance using parallel distributed file systems or speeding up data between storage and compute systems using InfiniBand networks. Additionally, Cisco and Broadcom, two members of UEC, have developed networking chips specifically for AI workloads.
Related articles:
About the Author
You May Also Like