
Another multivendor development group, the Ultra Accelerator Link (UALink) consortium, recently published its first specification aimed at delivering an open standard interconnect for AI clusters. The UALink 200G 1.0 Specification was crafted by many of the group’s 75 members — which include AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft and Synopsys — and lays out the technology needed to support a maximum data rate of 200 Giga transfers per second (GT/s) per channel or lane between accelerators and switches between up to 1,024 AI computing pods, UALink stated.
ESUN will leverage the work of IEEE and UEC for Ethernet when possible, stated Arista’s CEO Jayshree Ullal and chief development officer Hugh Holbrook in a blog post about ESUN. To that end, Ullal and Holbrook described a modular framework for Ethernet scale-up with three key building blocks:
- Common Ethernet headers for Interoperability: ESUN will build on top of Ethernet to enable the widest range of upper-layer protocols and use cases.
- Open Ethernet data link layer: Provides the foundation for AI collectives with high-performance at XPU cluster scale. By selecting standards-based mechanisms (such as Link-Layer Retry (LLR), Priority-based Flow Control (PFC) and Credit-based Flow Control (CBFC), ESUN enables cost-efficiency and flexibility with performance for these networks. Even minor delays can stall thousands of concurrent operations.
- Ethernet PHY layer: By relying on the ubiquitous Ethernet physical layer, interoperability across multiple vendors and a wide range of optical and copper interconnect options is assured.
“ESUN is designed to support any upper layer transport, including one based on SUE-T. SUE-T (Scale-Up Ethernet Transport) is a new OCP workstream, seeded by Broadcom’s contribution of SUE (Scale-Up Ethernet) to OCP. SUE-T looks to define functionality that can be easily integrated into an ESUN-based XPU for reliability scheduling, load balancing, and transaction packing, which are critical performance enhancers for some AI workloads,” Ullal and Holbrook wrote.
“In essence, the ESUN framework enables a collection of individual accelerators to become a single, powerful AI super computer, where network performance directly correlates to the speed and efficiency of AI model development and execution,” Ullal and Holbrook wrote. “The layered approach of ESUN and SUE-T over Ethernet promotes innovation without fragmentation. XPU accelerator developers retain flexibility on host-side choices such as access models (push vs. pull, and memory vs streaming semantics), transport reliability (hop-by-hop vs. end-to-end), ordering rules, and congestion control strategies while retaining system design choices. The ESUN initiative takes a practical approach for iterative improvements.”
Gartner expects gains in AI networking fabrics
Scale-up AI fabrics (SAIF) have captured a lot of industry attention lately, according to Gartner. The research firm is forecasting massive growth in SAIF to support AI infrastructure initiatives through 2029. The vendor landscape will remain dynamic over the next two years, with multiple technology ecosystems emerging, Gartner wrote in its report, What are “Scale-Up” AI Fabrics and Why Should I Care?
“Scale-Up” AI fabrics (SAIF) provide high-bandwidth, low-latency physical network interconnectivity and enhanced memory interaction between nearby AI processors,” Garter wrote. “Current implementations of SAIF are vendor-proprietary platforms, and there are proximity limitations (typically, SAIF is confined to only a rack or row). In most scenarios, Gartner recommends using Ethernet when connecting multiple SAIF systems together. We believe the scale, performance and supportability of Ethernet is optimal.”