2026-03-15 7 min read

Connecting the Dots: Why AMD Is the Only Company That Doesn't Need an Acquisition for the SRAM Inference Revolution

AMDNVIDIAGroqCerebrasAWSSRAMCompute-In-MemoryInferenceXilinxVersalHBM

Every AI chip on Earth has the same problem. And two companies just spent billions admitting they can’t solve it with GPUs alone.

The problem is called the memory wall. And the solution might already be sitting inside AMD — in a division most investors have forgotten exists.

The Problem: Why GPUs Are Wrong for Half the Job

When you ask an AI model a question, two very different things happen inside the hardware.

Phase 1: Prefill. The model reads your entire prompt in one massive parallel burst. This is a matrix multiplication party — thousands of compute cores crunching numbers simultaneously. GPUs are perfect for this. More cores, more HBM bandwidth, more speed. This is what NVIDIA’s GB300 and AMD’s MI455X are built to do.

Phase 2: Decode. The model generates its response one token at a time. Each token depends on the previous one. You can’t parallelize this — it’s inherently sequential. And here’s the problem: during decode, the GPU’s compute cores sit mostly idle, waiting for data from memory.

Two Phases, Two Bottlenecks

PrefillCompute-bound

Process entire prompt in parallel

Compute

85%

Memory

30%

Best hardware: GPU / ASIC

DecodeMemory-bandwidth-bound

Generate tokens one at a time

Compute

15%

Memory

95%

Best hardware: SRAM / CIM

One chip can't be optimal for both phases. That's why the stack is splitting.

During decode, a GPU is like a Formula 1 engine stuck in a school zone. You have thousands of compute cores — but you can only use a fraction of them because you’re bottlenecked by how fast you can read weights from memory.

This is the memory wall. And it means every GPU running inference is overprovisioned for compute and starved for bandwidth during the phase that matters most for user-facing latency.

The Solution: Don’t Move the Data

The traditional approach: compute lives in the processor, data lives in memory, and you shuttle data back and forth across a bus. Faster bus = less wall. That’s HBM — High Bandwidth Memory — and it’s why NVIDIA has locked up $95 billion in memory supply commitments.

But there’s a fundamentally different approach: what if you never move the data at all?

SRAM Compute-In-Memory (CIM) performs multiply-accumulate operations directly inside the memory array. The data doesn’t travel to the processor — the processor comes to the data. No bus. No latency. No memory wall.

SRAM vs HBM: Different Tools, Different Jobs

	SRAM / CIM	HBM	Delta
Latency	Sub-nanosecond	~10-20 ns	10-100x faster
Bandwidth/watt	Very high	High but power-hungry	4x+ advantage
Capacity	4-32 MB on-chip	288-432 GB	10,000x+ more
Cost per GB	Very expensive	Expensive but scalable	100x+ cheaper
Best for	Decode (sequential tokens)	Prefill (parallel prompt)	Different jobs
Data movement	Zero (compute in-place)	GPU ↔ memory bus	Eliminated

SRAM wins on speed and efficiency. HBM wins on capacity. The future uses both.

SRAM is 10-100x faster than external DRAM for access latency. It consumes far less power per operation because there’s zero data movement. The trade-off: it’s tiny. You can fit 4-32 megabytes on-chip, versus 288-432 gigabytes of HBM.

That trade-off doesn’t matter for decode. During token generation, you’re reading the same weight matrices over and over, one token at a time. A well-designed SRAM CIM chip can keep the hot weights on-chip and generate tokens at maximum speed without ever touching external memory.

SRAM eliminates the memory wall. HBM throws bandwidth at it. The future uses both.

The Deals That Prove the Split

Three deals in early 2026 show this isn’t theory — it’s the architectural direction of the industry.

AWS + Cerebras: Renting Speed

AWS announced it will offer Cerebras inference on Amazon Bedrock, splitting the pipeline: Trainium3 handles prefill (compute-heavy), Cerebras CS-3 handles decode (SRAM-based token generation). AWS doesn’t own the decode hardware — it rents it from Cerebras.

NVIDIA + Groq: Buying the Architecture

NVIDIA is acquiring Groq — the company that built the LPU (Language Processing Unit), an inference chip built entirely around on-chip SRAM. No HBM. Pure SRAM-based decode.

Post-acquisition, Groq boosted Samsung 4nm wafer orders by 70% — from 9,000 to 15,000 wafers. NVIDIA is scaling this for real deployment, not just acqui-hiring talent.

The irony: NVIDIA locked up $95.2 billion in supply commitments heavily weighted toward HBM. Now its own acquisition suggests HBM may become less central to inference. NVIDIA is developing a new chip leveraging Groq’s SRAM architecture — which could reduce NVIDIA’s own HBM demand.

Gimlet: The New Category

Gimlet is building an “agent-native inference cloud” that disaggregates workloads across GPUs, SRAM-centric chips, other accelerators, and CPUs. Different hardware for different phases. This is the clearest signal that disaggregated inference is becoming a product category, not just a research concept.

Every major player is moving toward the same conclusion: one chip can’t be optimal for both prefill and decode. The stack is splitting. The question is who owns both halves.

The Dot Nobody Has Connected: AMD Already Has Everything

Here’s the part that should make investors sit up.

NVIDIA had to acquire Groq to get SRAM expertise. AWS has to rent from Cerebras. Google runs monolithic TPUs with no SRAM decode play. None of them had SRAM Compute-In-Memory IP in-house.

AMD does. They’ve had it since 2022.

When AMD acquired Xilinx for $49 billion, most analysts focused on the FPGA business — programmable chips for telecom, automotive, and defense. What they missed: Xilinx had spent over a decade developing the most advanced SRAM-based compute architecture in the industry.

Who Owns What in Disaggregated Inference

	GPU (Prefill)	SRAM/CIM (Decode)	NPU (Edge)	CPU (Orchestration)
AMD Already owned (Xilinx $49B, 2022)	MI455X 432GB HBM4, 19.6 TB/s	Versal CIM 4MB SRAM, 4x perf/watt vs GPU	XDNA2 3.8x more efficient than GPU	EPYC Server CPUs sold out for 2026
NVIDIA Had to acquire Groq	GB300 / Rubin 288GB HBM3E/HBM4	Groq LPU On-chip SRAM, acquired 2026	—	Grace ARM server CPU
AWS Rents from Cerebras	— Uses NVIDIA GPUs	Cerebras CS-3 Rented, not owned	Inferentia Inference ASIC	Graviton ARM server CPU
Google Monolithic TPU — no disaggregation	— Uses own TPUs	— No SRAM CIM play	TPU v7 Monolithic, own interconnect	Axion ARM server CPU

Only AMD has all four columns filled with owned IP. Everyone else has gaps.

What AMD owns through Xilinx:

Versal AI Edge — An adaptive computing platform with 4MB of on-chip SRAM that delivers 4x AI performance per watt compared to leading GPUs. The SRAM provides deterministic, low-latency access that’s 10-100x faster than external DRAM. The architecture eliminates cache misses entirely — FPGA fabric attaches directly to compute cores without cache, delivering data every clock cycle. AMD calls this solving the “dark silicon” problem: no idle processing elements waiting for memory.

Versal AI Core — The datacenter variant, with the VC1902 delivering 133 INT8 TOPS, scaling to 405 INT4 TOPS across the portfolio.

FPGA LLM inference — On the Alveo V80 FPGA, Llama2-7B runs at 65.8 tokens/sec. On the VHK158, it hits 333 tokens/sec decode — competitive with dedicated inference ASICs.

FINN ML framework — A quantized neural network compiler specifically designed for FPGA inference, developed by AMD’s Integrated Communications and AI Lab.

What AMD owns alongside Xilinx:

MI455X Helios — 432GB HBM4, 19.6 TB/s bandwidth. The GPU side of the stack, shipping H2 2026.
XDNA2 NPU — 3.8x more power-efficient than GPU for inference. Built into every Ryzen AI processor.
EPYC — Server CPUs sold out for the entirety of 2026. The orchestration layer for agentic AI.

No other company on Earth owns all four compute modalities: GPU + FPGA/CIM + NPU + CPU.

NVIDIA has GPUs and Grace CPUs — but had to buy Groq for SRAM. AWS has Trainium and Graviton — but rents SRAM from Cerebras and doesn’t make GPUs. Google has TPUs and Axion CPUs — but runs monolithic inference with no SRAM decode play.

AMD is the only company that could build a complete disaggregated inference stack — GPU prefill + SRAM decode + NPU edge + CPU orchestration — without a single additional acquisition.

The Question the Market Should Be Asking

If AMD already owns the deepest SRAM CIM IP in the industry, the most power-efficient NPU architecture, a competitive datacenter GPU, and the server CPU that’s sold out everywhere — why hasn’t anyone connected these dots?

There are a few possible explanations:

Organizational silos. The Adaptive Computing division (ex-Xilinx) operates semi-independently from the datacenter GPU team. Versal is positioned for edge and embedded, not datacenter inference. The teams may not be talking to each other about disaggregated inference.

Strategic patience. AMD may be deliberately waiting. Let NVIDIA spend billions acquiring Groq and ramping Samsung wafers. Let AWS validate the disaggregated architecture with Cerebras. Then enter with a vertically integrated solution that doesn’t depend on any third party.

The scale gap. Versal’s 4MB SRAM is impressive for edge. But datacenter-scale LLM decode needs more. Scaling SRAM CIM from embedded to datacenter is a hard engineering problem — harder than buying Groq. AMD may not have the datacenter-scale CIM silicon ready yet.

Whatever the reason, the asymmetry is striking. The entire industry is racing to split inference into prefill and decode, spending billions on acquisitions and partnerships to get SRAM expertise. And AMD — the company everyone thinks of as “the GPU alternative” — has been sitting on the most complete hardware portfolio for disaggregated inference since 2022.

They just haven’t told anyone yet.

Claim references: [1fbf1fa0], [b1b735c6], [89f7ad3d], [00c4dbbf], [a40e0e8c], [27c9054f], [2f943716], [7d07608a], [39a5f30a], [295e13c0], [c6a1ac98], [969d02d2], [e3995993], [14a18fdc]