Data Center Networks: From Scale Up to Scale Out to Scale Outside

In 2024, the data center network landscape underwent rapid transformation, driven by the demands of generative artificial intelligence (AI) workloads. From the initial spotlight on ChatGPT to models like Gemini, Grok, and domestic offerings such as DeepSeek, Doubao, and Tongyi Qianwen, the development of large-scale AI models has surged, pushing the market’s demand for AI-specific data center computing power to unprecedented heights. Major players have ramped up investments to align with this trend:

Microsoft plans to invest $80 billion in 2025 to build data centers, primarily for AI applications.
Meta is committing $60 billion to $65 billion, mainly for data centers and servers, representing a 60% to 70% increase over 2024.
AWS has pledged $11 billion to support AI and cloud technology infrastructure, including launching the Rainier project to create a supercluster with hundreds of thousands of Trainium chips to serve clients like Anthropic.

These massive investments are fundamentally reshaping traditional data center network architectures, as the demands of AI training and inference push networks to new performance limits. The industry is innovating with Scale Up and Scale Out network solutions to meet these challenges.

At the chip level, companies like Broadcom and Marvell provide foundational technologies for these connections. Within racks, NVIDIA’s proprietary NVLink interconnect protocol competes with emerging open standards like UALink, while at the Scale Out level, InfiniBand and Ethernet solutions continue to evolve to meet AI workload needs. The Ultra Ethernet Consortium (UEC) and its UET protocol signal a strong industry push toward open standards.

Moreover, a key architectural shift is emerging: the basic unit of AI computing is moving from individual servers to integrated rack-scale systems, exemplified by NVIDIA’s GB200 NVL72 platform and AWS’s Trainium2 UltraServer. The network vendor landscape is also rapidly evolving, with companies racing to develop new architectures optimized for AI workloads.

Key Technologies in Data Center Networks

Scale Up Networks

Chip-to-Chip Communication: UCIe and Chiplets

The Open Compute Project (OCP), through its Open Domain-Specific Architecture (ODSA), continues to drive an open chiplet ecosystem. As competition intensifies among GPUs and AI accelerators, and chip release cycles shrink from 18-24 months to 12 months, the time-to-market advantage of chiplets becomes increasingly appealing for AI Systems-on-Chip (SoCs).

UCIe is an open industry standard defining interconnects between chiplets within a package, enabling a modular approach to system-level chip design. Unlike CXL, which focuses on processor-to-device connections on a motherboard, UCIe targets chip-to-chip communication within a single package. Supported by companies like AMD, Arm, Intel, Qualcomm, Samsung, and TSMC, UCIe reflects broad industry backing for an open chiplet ecosystem.

Key features of UCIe enable mixing and matching chipsets from different vendors while maintaining high performance:

Data rates up to 32 Gbps per pin.
Compatibility with standard and advanced packaging technologies.
Leveraging existing PCIe and CXL protocols.
Dedicated layers for chip-to-chip adaptation and physical connectivity.

PCIe

Standards-based PCIe has made significant strides in meeting AI system demands. PCIe 6.0 now supports up to 256 GBps of bandwidth with 16 lanes, while PCIe 7.0, with its version 0.7 specification released in January 2025, aims for an impressive 512 GBps. PCIe’s primary advantage remains its standards-based approach, enabling interoperability across a diverse chip ecosystem.

However, this evolution comes with challenges. As PCIe versions advance, maximum transmission distances shrink. To address this, PCIe retimers from vendors like Marvell become critical, extending reach and enabling connections between CPUs, GPUs, and I/O devices on servers.

CXL

Compute Express Link (CXL) is a forward-looking open standard suited for fine-grained connectivity within xPU server clusters and nodes. Built on PCIe 5.0/6.0 standards, CXL adds cache coherency, allowing xPUs to share a common memory pool with synchronized states.

The CXL architecture is built on three core protocols:

CXL.io: The foundational protocol handling device initialization, discovery, and basic I/O operations.
CXL.cache: Enables cache-coherent communication between host and device memory via ultra-low-latency request-response mechanisms.
CXL.mem: Allows host processors to directly access device memory through load/store commands, supporting both volatile and persistent memory.

PCIe switches and CXL show immense potential in enabling composable computing architectures (e.g., shared memory pools, shared storage, reconfigurable setups). However, as focus shifts to large-scale AI training clusters, CXL’s use is largely limited to shared memory access, with vendors increasingly betting on alternatives like UALink (for xPU-to-xPU links).

NVIDIA NVLink

Introduced in 2014, NVLink is NVIDIA’s proprietary high-speed, low-latency interconnect technology designed as an alternative to PCIe.

In 2024, NVLink saw significant performance gains. The fifth-generation NVLink doubled its predecessor’s 900 GBps throughput to 1.8 TBps per GPU. NVIDIA also introduced the NVLink Switch, a groundbreaking architecture enabling full-speed bidirectional connectivity between up to 576 GPUs.

UALink

UALink represents a major industry push for a high-speed, low-latency chip-to-chip interconnect standard tailored for AI and high-performance computing (HPC) accelerators. Formed in May 2024, the consortium includes key players like AMD, Intel, Google, Cisco, and Broadcom. To accelerate development, AMD contributed its Infinity Fabric shared-memory protocol and xGMI GPU-to-GPU interface to UALink, with members agreeing to adopt Infinity Fabric as the standard protocol for accelerator interconnects.

UALink offers several architectural advantages:

Scalability, supporting up to 1,024 accelerators in a single AI cluster.
Competitive high-bandwidth and low-latency performance.
40% improved energy efficiency.
Support for AI training and inference solutions.

Reports suggest multiple semiconductor firms are developing Ultra Accelerator Link switches. Meanwhile, Synopsys announced the first UALink IP solution, offering 200 Gbps per lane and supporting up to 1,024 accelerators, slated for release in late 2025.

Compared to UALink, NVIDIA’s NVLink retains advantages in maturity and deployment experience. NVIDIA CEO Jensen Huang noted that by the time UALink achieves commercial adoption, NVLink may have advanced to even higher performance levels.

Scale Out Networks

Today, many AI networks adopt a disaggregated architecture with separate front-end and back-end networks. The front-end network uses simple 100/200 Gbps Ethernet with a standard two- or three-tier Clos topology, connecting xPU clusters to external systems like applications and storage. The back-end network operates at higher speeds (400/800 Gbps), supporting the intensive data transfers required during AI training or computation tasks.

Whether using InfiniBand or Ethernet, back-end networks rely on RDMA (Remote Direct Memory Access) protocols for performance optimization. RDMA allows GPU nodes to read and write to each other’s memory without CPU involvement. This direct memory access is critical for AI workloads, reducing latency and CPU overhead. While RDMA was originally developed for InfiniBand, RoCE (RDMA over Converged Ethernet) has gained widespread support across vendors.

The back-end network is expected to grow significantly. The 650 Group predicts RDMA-related revenue will rise from $6.9 billion in 2023 to $22.5 billion by 2028. Dell’Oro forecasts that data center switches for AI back-end networks will drive nearly $80 billion in spending over the next five years.

InfiniBand dominates HPC workloads requiring high bandwidth and low latency, but Ethernet with RoCEv2 is gaining momentum due to:

Lower costs (estimated 40-50% cheaper than InfiniBand).
Greater familiarity among network engineers.
A broad ecosystem of tools and expertise.

Notable deployments highlight Ethernet’s growing viability for AI workloads:

Meta, despite owning InfiniBand AI clusters, used an Ethernet switch cluster to train its open-source Llama model weights.
xAI’s Colossus supercomputer adopted NVIDIA’s Ethernet-based Spectrum-X architecture across its 100,000 GPUs.

With increasing deployments and UEC’s efforts, the gap between Ethernet and InfiniBand may close rapidly.

InfiniBand

Through its acquisition of Mellanox, NVIDIA remains the leading provider of data center-scale InfiniBand solutions. InfiniBand has proven itself in HPC environments and established its role as a high-performance network for AI training.

In 2024, InfiniBand standards advanced significantly. The InfiniBand Trade Association (IBTA) released Volume 1 Specification 1.8 in September, greatly enhancing RDMA capabilities. This spec introduced XDR (Extended Data Rate), boosting per-lane speeds to around 200 Gbps, while improving reliability with XDR FEC (Forward Error Correction). It also expanded support for next-gen interfaces like 4-lane QSFP 800 Gbps and 8-lane QSFP-DD/OSFP 1600 Gbps. Additionally, it bolstered security for RDMA networks in data-intensive settings, improved congestion management, and enabled switches with up to 256 ports, fostering higher-radix switch development.

Ethernet and RoCE

Ethernet has made notable strides in supporting AI workloads. The IBTA enhanced RoCEv2 interoperability with InfiniBand, while the OpenFabrics Alliance improved RoCE support, further reducing latency and boosting data transfer speeds. Combined with Linux kernel enhancements, these improvements significantly elevate RoCEv2 performance on Linux systems.

Most Ethernet fabrics for AI training support RoCEv2 and implement additional scheduling and load-balancing features. A key differentiator among vendors is their approach to intelligent congestion control to prevent packet loss and reduce latency (including tail latency). Vendors have developed various strategies to enhance Scale Out architectures:

Endpoint- and Notification-Based Approaches: Focus on mitigating congestion after it occurs. These systems use Priority Flow Control (PFC), sending messages to source nodes to slow incoming data flows when receiver nodes hit queue depth thresholds.
Multipath Approaches: Take preventive measures via Equal-Cost Multipath (ECMP), identifying target paths with equal routing metrics and using hashing for load balancing of data flows.

While DPUs or SmartNICs can enforce these strategies, they still face challenges with in-cast patterns common in AI training workloads.

Scheduling-Based Solutions: Adopt a holistic approach, preventing congestion through end-to-end flow scheduling from input to output ports. These systems deliver deterministic latency and throughput, eliminate packet loss, and reduce jitter. Some vendors enhance this with packet-spraying techniques, distributing traffic at the packet level rather than flow level, offering finer load balancing than traditional flow-based methods.
Virtual Chassis Approaches: Treat multiple switches as part of a single logical chassis, aiding coordination across the network fabric. Vendors like Arista, Arrcus, and DriveNets have implemented such architectures, where switches interconnect to operate as a single logical switch or router. These solutions use a centralized control plane for consistent routing, scheduling, and management across all network nodes. They’re designed for resilient scaling and leverage advanced scheduling and load-balancing techniques to prevent congestion and optimize utilization.

Other technologies and topologies are also used for back-end xPU networks. For instance, MIT and Meta researchers proposed a Rail-Only network, eliminating Spine switches and leveraging high-bandwidth interconnects within nodes. Their approach stems from the observation that foundational model (FM) training traffic is sparse and stays within “rails” (i.e., GPUs of the same rank across nodes). For non-sparse traffic, such as in mixture-of-experts models where each expert communicates with other model parts, cross-rail traffic is forwarded via high-bandwidth intra-node interconnects (e.g., NVLink). This innovative architecture showed significant efficiency gains, cutting costs by 38% to 77% and power by 37% to 75% compared to existing state-of-the-art solutions.

UEC and UET

The Ultra Ethernet Consortium (UEC) is a significant industry initiative to create an open Ethernet transport protocol optimized for AI and HPC workloads. Since its inception, the consortium has grown rapidly, now encompassing hundreds of members across multiple sectors:

Semiconductor companies: NVIDIA, AMD, Intel, Broadcom.
Network equipment providers: Cisco, Arista, Juniper, Nokia.
System manufacturers: Dell, Huawei, Lenovo, HPE.
Hyperscale cloud providers: Meta, Microsoft, Alibaba, Baidu, Tencent.

In its 2023 whitepaper, UEC announced the development of the UET protocol, intended to eventually replace RoCE as the open Ethernet transport protocol for AI and HPC workloads. UET’s design goals are comprehensive, addressing limitations in existing protocols and preparing for future scaling challenges.

Key technical goals of UET include:

Scalability: Supporting up to 1 million connected endpoints.
Data speed: Achieving up to 1.6 Tbps transfer rates.
Embedded security: Providing security features at the transport layer.
Minimized connection setup time: Reducing the time needed to establish connections.
Reduced connection state overhead: Lowering resource demands for connection state management.

In 2024, the UEC made significant progress, and as of January 2025, it is finalizing the version 1.0 specification. Key member vendors are preparing to launch UET-supporting NICs and switches in sync with the spec release. The protocol introduces innovative features distinguishing it from existing solutions:

Multipath capabilities: Enhancing reliability and performance through multipathing.
Advanced packet-spraying techniques: Optimizing resource utilization.
Flexible delivery ordering: Eliminating the need for packet reordering before delivery.
Real-time telemetry-based automated congestion control: Adjusting strategies based on live data.
Built-in security features: Specifying authentication, authorization, and confidentiality without compromising performance.

Scale Out/Scale Outside: Front-End Networks

Front-end networks use traditional 2/3-tier Clos Ethernet architectures, but the rise of inference workloads is driving new demands. With multimodal inference gaining traction, north-south traffic in AI clusters is increasing, requiring higher data rates (100-400 Gbps) and stricter end-to-end QoS requirements.

Security has become a critical concern for these networks. Modern front-end architectures incorporate multiple layers of protection:

Dynamic data encryption.
Least privilege/zero-trust frameworks.
Role-based network access control.
Intelligent firewall capabilities.

IPv6 Segment Routing (SRv6) has emerged as a key technology for enhancing front-end network performance. SRv6 embeds QoS prioritization and fine-grained traffic steering directly into IPv6 packet headers, offering greater flexibility than MPLS. This approach enables rich orchestration while reducing control plane overhead.

Major network vendors like Cisco, Juniper, Arista, Nokia, and Arrcus have adopted SRv6. For example, Arrcus leverages SRv6 to achieve end-to-end QoS across multiple network domains.

Scale Outside: Data Center Interconnect

At its 2024 Networking@Scale conference, Meta revealed that AI model training’s impact on backbone networks has exceeded initial forecasts. Major operators like Lumen and Zayo confirmed this trend in investor reports, emphasizing the importance of robust data center interconnects (DCI).

To meet these needs, optical interconnect technology is advancing rapidly. The industry has widely adopted 400ZR/ZR+ modules, tailored to different use cases:

Standard ZR modules optimize 400 Gbps rates for distances up to 120 km.
ZR+ variants, via OpenZR+ and OpenROADM standards, support flexible modulation and extend reach to 400 km.

The next frontier in DCI technology is 800ZR/ZR+. The Optical Internetworking Forum (OIF) released the 800ZR Implementation Agreement in October 2024, marking the arrival of this new standard. It brings significant advancements:

Using 16QAM technology, enabling 800 Gbps rates over 520 km.
ZR+ variants extending reach beyond 1,000 km.
30% lower power per bit compared to 400ZR.

These developments are critical for AI workloads, with 800ZR+ device shipments expected to grow rapidly. The technology’s improved energy efficiency is especially valuable for power-constrained data centers supporting AI infrastructure.

Other Data Center Considerations

Beyond networking and interconnects, additional factors shape modern data center architectures. SmartNICs and DPUs play an increasingly vital role in fabric management and security enforcement, particularly in security and isolation.

Security and DPUs

AI front-end and cloud data center networks share common security needs, especially in network segmentation. These include:

Robust tenant isolation.
Preventing unauthorized lateral data movement.
Application-based network policy enforcement.

In modern architectures, DPUs and SmartNICs serve dual purposes. Beyond offloading RDMA functions (e.g., segmentation, reassembly, and advanced congestion control), they provide sophisticated security features:

Hardware-accelerated encryption/decryption.
Secure key management.
Advanced firewalls.
Granular microsegmentation.

AWS’s Nitro architecture exemplifies this approach, delivering a secure foundation for cloud services via bare-metal hypervisors. Following this model, NVIDIA (via BlueField DPUs) and AMD (via Pensando DPUs) are enhancing their enterprise data center offerings. Google also employs a similar strategy, using Titanium custom chip security microcontrollers for analogous security and offload functions.

Open Source and SONiC

SONiC continues to gain traction in data center networking, expanding from hyperscale data centers to enterprise deployments. Major vendors are now positioning SONiC as a solution for AI workloads, with several notable implementations:

Arista: An early SONiC contributor, Arista enables the platform on its 7050X and 7060X series switches, offering customers flexibility in network OS choices.
Cisco: Partnering with startup Aviz Networks, Cisco provides enterprise-grade SONiC support on its 8000-series routers, leveraging Silicon One ASICs for high-performance 400G networking.
Juniper: Offers SONiC as an option on its QFX and PTX platforms, integrating it with the Apstra automation platform for enhanced management.
Nokia: Adopts a dual strategy, supporting both community SONiC and its proprietary SR Linux. Notably, Nokia aided Microsoft Azure’s transition from 100G to 400G infrastructure.
NVIDIA: A key SONiC proponent, NVIDIA enables it on its Spectrum open Ethernet switches and Spectrum-X platform. As a SONiC governance committee member and major contributor, NVIDIA is shaping the platform’s future.

This growing adoption reflects an industry shift toward open networking solutions that reduce vendor lock-in while maintaining enterprise-grade reliability. This trend is particularly relevant for AI-focused data centers, where flexibility, scalability, and cost-efficiency are paramount.

CPO, LRO, and LPO: Transforming Data Center Interconnects

As AI workloads drive bandwidth demand, alongside concerns over power and cost, innovations in optical interconnect technologies are accelerating. Three approaches have emerged, each offering unique solutions to these challenges:

Co-Packaged Optics (CPO): Integrates optical engines directly onto the same substrate as the switch chip. This tight integration reduces signal loss by shortening electrical paths and lowers power via low-power SerDes. CPO shows immense potential in AI data centers, where enhanced system efficiency and scalability are critical. However, adoption faces challenges, including technology maturity and complex business model issues.
Linear Receive Optics (LRO): Takes a more incremental approach, retaining DSPs (Digital Signal Processors) in the transmit path but removing them from the receive path. This hybrid strategy blends analog and digital processing to cut power while maintaining interoperability with industry standards. LRO serves as a transitional technology between legacy optical modules and more advanced solutions, balancing performance and power efficiency.
Linear Pluggable Optics (LPO): Pursues simplification by removing traditional DSPs and CDRs (Clock and Data Recovery) chips. This direct-drive approach reduces power and latency while retaining hot-pluggability. Proven suitable for short-reach applications within data centers, LPO offers an ideal blend of performance and cost-effectiveness.

Vendor Landscape

Network Vendors

Arista

Arista’s AI networking portfolio centers on its 7700R4 Distributed Etherlink Switch (DES) architecture, which transcends traditional chassis limitations while maintaining predictable performance. With a distributed scheduler design, it supports over 27,000 800GbE ports or 31,000 400GbE ports, creating a single logical switching domain. This architecture combines four key technologies:

Cell-based traffic dispersion for even load distribution.
Virtual Output Queuing (VOQ) to prevent head-of-line blocking.
Distributed credit scheduling to eliminate “noisy neighbor” effects.
Deep buffering to handle microburst traffic.

Arista’s CloudVision platform and AI Analyzer provide microsecond-level visibility and management. Its EOS (Extensible Operating System) ensures consistent operations with features like “smart system upgrades” for seamless software updates during long-running AI training tasks. Compared to traditional DSP-based solutions, the platform supports low-power operations (LPOs) to reduce energy use.

Cisco

Cisco tackles AI networking with a three-pronged approach for varying deployment scales. Its Nexus HyperFabric AI, built around the 6000-series switches, offers a cloud-managed solution with automated design and deployment. Based on a VPN-VXLAN underlay, it supports speeds from 10 Gbps to 400 Gbps, with the Nexus Dashboard providing comprehensive telemetry and automation.

The Nexus 9000 series, notably the 9364E-SG2 switch, forms the backbone for mainstream AI deployments, supporting speeds from 100 Gbps to 800 Gbps. It delivers key AI/ML capabilities like dynamic load balancing, PFC, and DCQCN.

For hyperscale deployments, Cisco’s 8122-64EH router leverages its Silicon One G200 processor—a 5nm, 51.2T-capable chip delivering 64 800G ports in a 2RU form factor. Advanced features include improved flow control, congestion awareness, hardware-based link failure recovery, and sophisticated packet-spraying capabilities.

Juniper

Juniper meets AI networking needs with standards-based Ethernet solutions, using its PTX and QFX switches for leaf-spine connectivity. At maximum configuration, this architecture delivers up to 460.8 Tbps throughput with 576 800GbE ports, connecting over 18,000 GPUs in a two-tier Clos network. Its strategy blends custom silicon (Juniper Express 5) and merchant silicon (Broadcom Tomahawk), offering architectural flexibility.

The platform combines ECN, PFC, and DCQCN for congestion management, ensuring lossless transmission.

To simplify operations, Juniper introduced Apstra, an intent-based networking software. Apstra automates the network lifecycle from initial deployment (Day 0) to ongoing operations (Day 2+), with rail-optimized design capabilities tailored for GPU-intensive workloads. It enables users to manage back-end, front-end, and storage networks via a single interface while allowing flexibility in hardware vendor choices. With Apstra’s closed-loop automation, telemetry from routers and switches optimizes congestion control parameters for peak AI workload performance.

Nokia

Nokia leverages its IP routing and optical transport portfolio to build data center backbones while offering data center switching solutions (recently adding open-source SONiC support) and enabling EDA within data centers. Its Nokia 7215 IXS, 7220, and 7250 IXR series, powered by the SR Linux network OS, alongside the 7750 Service Router (data center gateway), provide a scalable solution with fixed and modular configurations for deployment flexibility. This portfolio supports large-scale AI Ethernet fabrics, offering up to 460.8 Tb/s full-duplex capacity and 576 800GbE ports.

Nokia recently secured significant data center contracts, cementing its role in the AI-driven market. In September 2024, it signed a commercial deal with AI infrastructure leader CoreWeave. In November, it extended its five-year agreement with Microsoft Azure to supply data center routers and switches, expanding coverage to over 30 countries and supporting Azure’s 100G-to-400G upgrade. In December 2024, Nokia won a deal to provide IP networking gear for Nscale’s new data center in Stavanger, Norway. This momentum positions Nokia favorably for 2025 growth in AI-related data center builds.

NVIDIA

NVIDIA’s InfiniBand platform centers on the Quantum-X800 series, designed for HPC and “trillion-parameter-scale” AI workloads requiring 800 Gbps throughput. The flagship Q3400-RA switch offers 144 800Gbps ports, supporting two-tier fat-tree topologies to connect over 10,000 endpoints. Its fourth-generation SHARP technology accelerates workloads by offloading compute operations to the network.

For smaller deployments, the Q3200 switch provides two independent 36-port switches in a 2RU space. Both integrate with NVIDIA’s Unified Fabric Manager software and feature dedicated InfiniBand management ports. Pairing Quantum switches with ConnectX-8 SuperNICs and LinkX interconnects, NVIDIA delivers a comprehensive AI networking solution.

In Ethernet environments, NVIDIA’s Spectrum-X platform combines Spectrum-4 switches with BlueField-3 SuperNICs to optimize AI workloads. The SN5600 switch offers 64 800G ports in 2RU with 51.2 Tb/s throughput, implementing RoCE extensions with deep learning-based adaptive routing and telemetry-driven congestion control. Paired with BlueField-3, it provides microsecond visibility from GPU to network while handling out-of-order packets. The broader SN5000 series leverages a shared 160MB packet buffer architecture for predictable inter-port latency.

Cloud Service Providers

AWS

At its 2024 re:Invent conference, AWS unveiled its 10p10u network architecture, tailored for AI workloads. This delivers 10PB of network capacity with sub-10-microsecond latency within its data centers, incorporating novel hardware solutions:

A proprietary trunk connector bundling 16 fiber optic cables.
The Firefly optical plug-in system, cutting installation time by 54% through pre-deployment testing.
Scalable Intent-Driven Routing (SIDR) protocol, blending centralized planning with distributed execution for sub-second failure response.
NeuronLink, a proprietary Scale Up interconnect for Trainium2 servers, enabling low-latency chip-to-chip communication at 1 Tbps in a 2D ring topology.

The 10p10u infrastructure, built on 800 Gbps Ethernet, integrates tightly with AWS’s UltraServer compute tech and Trainium2 AI chips. Their custom EFA adapter enhances this with the Scalable Reliable Datagram protocol for efficient data transfer, scaling the network from single racks to multi-campus clusters while maintaining performance.

Google Cloud

Google Cloud recently bolstered its network infrastructure with:

The Titanium ML network adapter, leveraging NVIDIA’s ConnectX-7 hardware and a four-rail-aligned architecture, powering its A3 Ultra VMs with non-blocking, 3.2 Tbps GPU-to-GPU transfers via RoCE.
The sixth-gen TPU Trillium, offering 4x training and 3x inference throughput over TPU v5e, with 67% better energy efficiency. Its high-speed inter-chip interconnect scales up to 256 chips in a single high-bandwidth, low-latency unit, with further scaling via the 13 PB/s Jupiter data center network.
A supercomputing cluster for large-scale AI workload management, enabling dense resource co-location, targeted workload distribution across thousands of accelerators, and advanced maintenance to minimize disruptions. It allows customers to deploy and manage large accelerator pools as a single unit with ultra-low-latency networking.
Enhanced cloud interconnect services with application-aware features, optimizing traffic prioritization during congestion and improving bandwidth utilization.

Other Ecosystem Vendors

Marvell

Marvell has solidified its role in data center networking with a broad semiconductor portfolio, especially for AI workloads. Its flagship Teralynx 10 Ethernet switch entered volume production in July 2024, offering 51.2 Tbps capacity with 500ns latency. Its high-radix design boosts infrastructure efficiency, reducing network tiers in large AI clusters by up to 40% for 64K xPU deployments.

Beyond switches, Marvell addresses key connectivity needs. The Structera CXL line tackles memory bandwidth challenges in AI-driven environments, while Alaska P PCIe retimers enable high-speed links between AI accelerators, GPUs, and CPUs. In optics, the new Aquila DSP supports 1.6 Tbps optical transceivers for DCI, optimized for O-band wavelengths, cutting per-link costs compared to traditional C-band solutions. These innovations are validated by strategic partnerships, including a five-year deal with AWS for custom AI products, optical DSPs, and Ethernet switching solutions.

Broadcom

Broadcom maintains dominance in data center networking with its comprehensive switch portfolio, notably the Tomahawk series. Tomahawk 5 marks a major leap:

51.2 Tbps throughput with up to 64 800G ports.
Energy efficiency below 1W per 100 Gbps via 5nm process tech.
Cognitive routing optimized for AI/ML workload latency.
Proven support for hyperscale AI clusters, handling infrastructure with over 1 million xPUs.

JPMorgan estimates Broadcom holds an 80% share of the $5-7 billion data center/AI Ethernet switch market. Beyond Tomahawk, it innovates across lines. The Trident 5-X12 for enterprise and cloud offers 16 Tbps with 800G uplinks and an on-chip neural network for traffic pattern analysis. The upcoming Tomahawk 6, slated for a 2025 optical conference debut, targets 102.4 Tbps using 3nm tech.

Astera Labs

Astera Labs went public in March 2024, integrating high-speed connectivity ICs, modules, and boards with its COSMOS fabric management software. Its portfolio includes:

Aries PCIe/CXL Smart DSP Retimers: Address signal integrity in AI and general servers with low power.
Aries PCIe/CXL Smart Cable Modules: Enable long-reach copper connections for decoupled cloud/AI infrastructure.
Leo CXL Smart Memory Controllers: Support memory expansion and pooling for optimized cloud server memory use.

Conclusion

In 2025, AI workloads—particularly large language models (LLMs) and other foundational models (FMs)—are reshaping the data center networking landscape.

The networking industry is innovating across multiple fronts, from new standards and open initiatives like UALink and UEC/UET to the continued evolution of proprietary solutions like NVIDIA’s NVLink. The traditional divide between Scale Up and Scale Out networks is blurring, with rack-scale computing emerging as a new architectural paradigm. Meanwhile, significant strides are being made in optical networking, congestion control, and network automation.

Looking ahead, the pace of innovation shows no signs of slowing. The industry’s ability to meet AI workload demands while addressing practical constraints will be critical to enabling next-generation AI applications and services. Companies that successfully navigate these challenges while remaining adaptable for future growth will hold the strongest positions in this rapidly evolving landscape.