A Vision to Address the AI Network Wall

Published: March 10, 2026
 

A Vision to Address the AI Network Wall

Hyperscalers, neoclouds and frontier labs are in an all-out race to seize the advantage in AI. Whether in pursuit of generative AI, agentic AI, physical AI, superintelligence or other frontiers in AI, performance and scalability of model training and inference has become the competitive background that will determine the winners and losers in this race.

This race is driving trillions of dollars of investment in AI infrastructure. However, the sad truth is that a significant share of that investment is being wasted. The datacenter network has become a critical bottleneck that is throttling AI, preventing these investments from delivering commensurate gains in speed, throughput and scale for AI.

Modern AI clusters and data centers are critically dependent on moving massive amounts of data between GPUs at speed and scale. Critical steps in AI processing—gradient all-reduce, model-layer synchronization, key-value cache shuffling, all-to-all operations—require synchronized data movement across all GPUs within racks, across racks and even across buildings. The slowest link in that data movement stalls the entire operation, leaving GPUs sitting idle waiting for data. Not only does inefficient networking impact utilization and performance, it also contributes to power demands, burning megawatts pushing traffic across bloated multi-tier network architectures.

The network bottleneck is only getting worse because of continuing advances in AI. The demands on the network increase significantly as models grow to over a trillion parameters, context windows expand to millions of tokens, the use of multi-modal and MoE (mixture-of-experts) models grows, and AI processing steps such as prefill and decode are disaggregated to separate hardware.

The network bottleneck impacts both scale-up and scale-out networks. The size of scale-up domains is critical to being able to deliver advances in performance of AI models, but the maximum size of scale-up domains is only advancing slowly, currently limited to only 72 GPUs, in large part because of the limits of what networking solutions can support in a single low-latency tier. Scale-out domains are similarly handicapped by the network—as scale-out domains grow toward hundreds of thousands and even over a million GPUs, performance and latency (and reliability) suffer as additional network tiers, additional switches and additional optics are added.

A Reengineering of Networking is Required

Compute performance has been increasing at a near-exponential rate—we’ve seen GPUs go from approximately 4 PFLOPS in 2022 to 20+ PFLOPs today, and roadmaps foresee packages that deliver 100s of PFLOPs coming in just a few years. However, networking has been stuck in a plodding pace of incremental improvement, doubling in capacity every 24 – 36 months. Over the last four years, network switches have only advanced from 25.6 Tbps to 51.2 Tbps, with 102.4 Tbps only emerging this year (“The Widening Chasm: XPU Data Hunger Outpaces Network Infrastructure”, 650 Group).

It’s clear that the gap between what AI needs from networking and what network infrastructure is able to support can’t be closed with just further evolution of existing technology, or even with new technology that offers only a modest improvement. Simply adding more switches and optics along the current architectural path only increases latency, power draw, cooling requirements and overall inefficiency.

A Vision for a New Generation of AI Networking

The next generation of networking for AI will be defined by the following:

  • An order of magnitude leap. Incremental improvements are insufficient not only to close the gap between AI demands and what today’s networking technology delivers, but also to justify transitioning to a new generation of datacenter networking. An order of magnitude improvement—in radix, throughput and scalability–is required to support ever-growing AI demands.
  • A clean-sheet design spanning from silicon to systems and software. Delivering that leap requires a new architecture, combined with the latest advances across multiple domains. Only a holistic design that unifies silicon, optics, packaging, and system architecture can deliver the step function improvement required, by coordinating design and implementation decisions throughout the system to ensure that every component operates at maximum efficiency and performance.
  • The ability to support scale-out domains of hundreds of thousands of GPUs and scale-up domains of thousands of GPUs. As AI clusters and datacenters grow ever larger, the ability to support large domains ensures low and consistent network latency, avoiding network jitter and high tail latencies that crater performance for AI workloads.

Making an Impact Across the AI Datacenter

Solving the network bottleneck isn’t just an optimization that makes the network slightly faster or slightly more efficient. A true order of magnitude advance impacts the entire AI data center.

For scale-up, an order of magnitude increase in radix and throughput unlocks significantly larger and more powerful domains. Thousands of GPUs in a scale-up domain becomes possible because they can be supported by a single tier of network, ensuring the consistent low latency required for training large, advanced models.

For scale-out, that order of magnitude increase enables the dramatically larger scale-out domains needed to support large-scale AI datacenters. Rather than requiring multiple layers of spine switches, a two-tier network can support over a million of current GPUs.

The ability to flatten the back-end network not only simplifies the network and enables larger scale-up and scale-out domains, it also impacts the economics, efficiency and productivity of the AI data center. For example, it can translate into a 90% reduction in the number of switches required because a single switch can replace 30 of today’s switches, reducing CapEx by 40% and network power consumption by 70%.

 

“10x increase in radix can replace 30 switches with one Eridu switch”

Reducing the number of switches also reduces the number of components—cables, transceivers, and more—which further decreases costs and increases reliability. That improvement quickly adds up–in a 200K GPU cluster, every 10% increase in GPU utilization translates into over $1B of cost reduction.

Delivering on the Vision

There are no shortcuts to reinvention. Incumbents are trapped by their past success, locked into a pace of slow incremental evolution of their existing technologies and architectures, while newcomers taking shortcuts to a modest improvement do not deliver the value needed to justify transitioning to new technology.

We founded Eridu to take on the challenge. We’ve built a team that brings deep expertise across networking ASICs, advanced packaging, optical technologies, systems engineering, and software. Collectively, we have led dozens of tape-outs, created extensive patent portfolios across hardware and software, and delivered multiple generations of high-performance networking products at leading technology companies.

We’ve been intensely working to solve some of the hardest challenges in Al infrastructure. From early days on whiteboards, to refining our designs through simulation, to full-on development, we’re excited about how far we’ve come and about our future delivery of a groundbreaking solution to market.

The future demands a reimagined network. We’re excited to be bringing that future to reality.