The Split Path of SDN and RDMA Networks
Over the past decade, Software-Defined Networking (SDN) emerged as a game-changer, becoming the go-to standard for data center networking. Its rise aligned perfectly with the needs of cloud-scale operations, offering substantial gains in efficiency. SDN delivered two key benefits: enhanced control and greater flexibility. By separating network control from data forwarding, it allowed operators to program the network dynamically while abstracting the underlying hardware from applications and services. This programmability sharpened control, and the abstraction unlocked flexible implementation, driving faster feature rollouts and predictive management in data centers.
Meanwhile, Remote Direct Memory Access (RDMA) networks carved out a niche in smaller-scale deployments, primarily in storage and high-performance computing (HPC). These setups were modest compared to the sprawling infrastructures where SDN thrived. Early attempts to scale RDMA for cloud environments—seen in pioneers like Microsoft Azure and Amazon Web Services (AWS)—hit roadblocks tied to control and flexibility, though each tackled the hurdles differently, as we’ll explore later.
Now, in the current decade, the explosive growth of AI—particularly GPU-driven backend or scale-out networking—has thrust RDMA into the spotlight. The surge in Generative AI (GenAI) and Large Language Models (LLMs) has unleashed new demands: higher performance, scalability, and unique traffic patterns. LLM training, for instance, generates sporadic data bursts with low entropy and intense network use, maxing out modern RDMA NICs at 400 Gbps. These patterns throw off traditional Equal-Cost Multi-Path (ECMP) load balancing, a staple of SDN-based data centers, pushing operators to rethink their designs.