Without question, the biggest bottleneck in artificial intelligence and for a lot of HPC workloads today is bandwidth. Bandwidth at the network level; bandwidth at the socket level; bandwidth at the compute and memory level. No matter how many teraflops one single chip can push at high precision, once your workload scales beyond a single accelerator, node, or rack, bandwidth quickly becomes the limiting factor.
We have seen chipmakers grapple with this on a number of levels, by packing more high-bandwidth memory onto their chips, boosting interconnect speeds, and by using chiplets to push beyond reticle limits. Intel’s “Ponte Vecchio” Max Series GPU and AMD’s recently announced “Antares” Instinct MI300X GPU are prime examples of the latter. Driving data between chiplets does introduce I/O bottlenecks in its own right, but we can’t exactly make the dies any bigger.
Aside from needing a socket that is bigger than the reticle limit of lithography machines, we still need more capacity to satiate the demands of modern AI and HPC workloads. Over the past few years, we’ve seen a trend toward denser boxes, racks, and clusters. Cloud providers, hyperscalers, and GPU bit barns are now deploying clusters with tens of thousands of accelerators to keep up with demand for AI applications. It’s at this beach head that silicon photonics startup Lightmatter, now valued at more than $1 billion, believes it has the market cornered.
Comments are closed.