
ESUN initiative
As part of its standardization efforts, Meta said it would be a key player in the new Ethernet for Scale-Up Networking (ESUN) initiative that brings together AMD, Arista, ARM, Broadcom, Cisco, HPE Networking, Marvell, Microsoft, NVIDIA, OpenAI and Oracle to advance the networking technology to handle the growing scale-up domain for AI systems.
ESUN will focus solely on open, standards-based Ethernet switching and framing for scale-up networking—excluding host-side stacks, non-Ethernet protocols, application-layer solutions, and proprietary technologies. The group will focus on the development and interoperability of XPU network interfaces and Ethernet switch ASICs for scale-up networks, the OCP wrote in a blog.
ESUN will actively engage with other organizations such as Ultra-Ethernet Consortium (UEC) and long-standing IEEE 802.3 Ethernet to align open standards, incorporate best practices, and accelerate innovation, the OCP stated.
Data center networking milestones
The launch of ESUN is just one of the AI networking developments Meta shared at the event. Meta engineers also announced three data center networking innovations aimed at making its infrastructure more flexible, scalable, and efficient:
- The evolution of Meta’s Disaggregated Scheduled Fabric (DSF) to support scale-out interconnect for large AI clusters that span entire data center buildings.
- A new Non-Scheduled Fabric (NSF) architecture based entirely on shallow-buffer, disaggregated Ethernet switches that will support our largest AI clusters like Prometheus.
- The addition of Minipack3N, based on Nvidia’s Ethernet Spectrum-4 ASIC, to Meta’s portfolio of 51Tbps OCP switches that use OCP’s Switch Abstraction Interface and Meta’s Facebook Open Switching System (FBOSS) software stack.
DSF is Meta’s open networking fabric that completely separates switch hardware, NICs, endpoints, and other networking components from the underlying network and uses OCP-SAI and FBOSS to achieve that, according to Meta. It supports Ethernet-based RoCE RDMA over Converged Ethernet (RoCE/RDMA)) to endpoints, accelerators and NICs from multiple vendors, such as Nvidia, AMD and Broadcom including its own MTIA/accelerator stack. It then uses scheduled fabric techniques between endpoints, particularly Virtual Output Queuing for traffic scheduling to proactively avoid congestion rather than just reacting to it, according to Meta.
“Over the last year, we have evolved DSF to a 2-stage architecture, scaling to support a non-blocking fabric that interconnects up to 18,432 XPUs,” wrote a group of Meta engineers in a co-authored blog post about the new advances. “These clusters are a fundamental building block for constructing AI clusters that span regions (and even multiple regions) in order to meet the increased capacity and performance demands of Meta’s AI workloads.”