2026-03-17 11 min read

The AI Memory Stack, Now and Future

MemoryHBM4CXLSRAMGroqNVIDIAInferenceKioxiaMarvellSK Hynix

Everyone talks about GPUs. Nobody talks about the memory bill.

A single NVIDIA Vera Rubin NVL72 rack contains 72 GPUs, each with 288 GB of HBM4. That’s 20.7 terabytes of the fastest, most expensive memory ever manufactured — and the HBM4 memory component alone accounts for an estimated $10-16 million of each rack’s cost.

But here’s the part that will make your CFO cry: 20 TB isn’t enough. When your model runs 1,000 concurrent users with million-token context windows, the KV cache alone needs 100+ TB. Where does it go?

The answer is: the AI industry is building an entirely new memory hierarchy — five tiers deep — and every single layer has a new product shipping in 2026.

The Five-Tier Memory Stack

The 2026 AI Memory Hierarchy

SRAM (Groq LPU)Embedded/GB

On-chip

Capacity: ~60 GB/rackLatency: <1 ns

Decode-only. Eliminates HBM bottleneck for token generation. 25% of Jensen's data center recipe.

HBM4$500-800/GB

20 TB/s

Capacity: 20.7 TB/rackLatency: ~10-20 ns

Compute layer. 288 GB per Vera Rubin GPU × 72 = 20.7 TB/rack. Fastest, scarcest, 90%+ of memory cost.

CXL Memory$5-10/GB

~50-128 GB/s

Capacity: 50-100 TB/rackLatency: ~100-200 ns

KV cache overflow. Marvell 260-lane CXL switch + Penguin 11 TB server shipping now. 10x faster than NVMe.

HBF (SK Hynix)TBD/GB

TBD

Capacity: TBDLatency: TBD

Bridge layer. Memory vendor approach to the HBM-SSD gap. Complementary or competitive with CXL — TBD.

GPU-Direct Flash$0.10-0.50/GB

~7-50 GB/s

Capacity: 100+ TB/rackLatency: ~10-100 µs

Cold KV cache, checkpoints, model weights. Kioxia GP Series: 512-byte GPU-direct access. NVIDIA Storage-Next.

This isn’t theoretical. Every tier in this chart has hardware shipping or sampling right now. Let’s walk through each one.

Tier 1: SRAM — The Speed Demon

Jensen’s GTC architecture splits inference into two phases: prefill (understanding your prompt) and decode (generating the response). Prefill is compute-bound — it needs GPUs. Decode is memory-bandwidth-bound — it needs raw speed.

Enter Groq. Their LPU (Language Processing Unit) keeps everything in on-chip SRAM — no off-chip memory access at all. The result: decode at wire speed, sub-nanosecond latency. Jensen told the GTC audience he’d put “25% Groq, 75% Vera Rubin” in every data center.

The Groq3 LPU is being manufactured by Samsung Foundry on 4nm, with shipments to NVIDIA starting Q3 2026. A Groq rack holds ~60 GB of SRAM — tiny capacity, but at speeds no other memory can match.

SRAM for decode speed. HBM for compute bandwidth. Two different memories for two different phases of a single inference request.

Tier 2: HBM4 — The Compute Engine

HBM4 is the workhorse. Vera Rubin packs 288 GB per GPU at 20.5 TB/s bandwidth. Multiply by 72 GPUs in an NVL72 rack and you get 20.7 TB at 1,580 TB/s aggregate — nearly 3x the bandwidth of last-gen GB300.

All three memory vendors (Samsung, SK Hynix, Micron) are now qualified. Samsung is tripling HBM output with half focused on HBM4. SK Hynix showed 16-layer HBM4 at GTC. Samsung unveiled HBM4E specs: 4 TB/s per stack, 16 Gbps, 48 GB capacity — the next generation targeting Rubin Ultra.

But HBM has a fundamental problem: it’s too expensive to scale for capacity. At $500-800 per GB, storing 100 TB of KV cache in HBM would cost $50-80 million per rack. Nobody is doing that.

This is why every tier below HBM exists.

Tier 3: CXL Memory — The Capacity Play

Between HBM’s $500/GB and NVMe’s $0.10/GB, there was nothing. CXL fills that gap.

CXL (Compute Express Link) lets you plug terabytes of regular DDR5 memory into a server and have GPUs access it coherently — 10x faster than NVMe, at $5-10 per GB. For KV cache overflow from long-context models, this is the sweet spot.

Three things just made CXL real:

1. The CPU gate opened. CXL rides on PCIe — you need a CXL-capable CPU. Both AMD EPYC Turin and Intel Xeon 6 now ship CXL 2.0. Intel Xeon 6 is the CPU inside NVIDIA DGX Rubin NVL8 — CXL memory pooling is built into NVIDIA’s flagship from the host CPU level.

2. The switch arrived. Marvell just announced the Structera S 30260 — a 260-lane CXL switch that enables rack-level memory pooling. This is the first concrete CXL switching product. It works with Structera A near-memory accelerators, Structera X expansion controllers, and Alaska P retimers — a complete CXL fabric.

3. The server exists. Penguin Solutions’ MemoryAI is the first production-ready CXL KV cache server: 11 TB of CXL memory, compatible with NVIDIA Dynamo 1.0. Research (Beluga) showed 7.35x throughput vs RDMA and 89.6% TTFT reduction.

Google is already deploying CXL controllers in production. For a Vera Rubin NVL72 rack, we estimate 50-100 TB of CXL memory needed for KV cache overflow — at $250K-1M vs $25-50M for the same capacity in HBM.

CXL Version Progression — Why 2.0 Matters

CXL 1.1

2022

Host → Device only

CPU can read/write device memory. No pooling. Shipped with Intel Sapphire Rapids, AMD Genoa.

PCIe 5.0

CXL 2.0

2023-24

Memory Pooling + SwitchingGPU-direct

Multiple hosts share a CXL memory pool via switches. GPU-direct access possible. AMD Turin + Intel Xeon 6 ship this.

PCIe 5.0

CXL 3.0

2025

Fabric + Peer-to-PeerGPU-direct

Multi-level switching, fabric topologies, coherent P2P across hosts. Full rack-scale memory fabric.

PCIe 6.0

CXL 4.0

Nov 2025

128 GT/s + Bundled PortsGPU-direct

Doubles bandwidth. Bundled ports for 1.5 TB/s connections. Multi-rack memory pooling at 100+ TB scale.

PCIe 7.0

CXL 2.0 is the critical threshold. That’s where GPU-direct memory access via switches begins. CXL 3.0 (fabric topologies, peer-to-peer) enables multi-rack memory pools. CXL 4.0 (PCIe 7.0, 128 GT/s) doubles bandwidth again. The hardware is here; the ecosystem is catching up.

Tier 3.5: HBF — The Memory Vendor’s Bridge

SK Hynix is building HBF (High Bandwidth Fan-out) — a new memory product that bridges HBM and SSD from the vendor side. Think of it as CXL’s cousin: same niche (between HBM and flash), different approach (memory product vs interconnect standard).

Whether HBF and CXL compete or complement each other is one of the most interesting open questions in memory architecture. Both validate the same insight: the industry knows it needs a middle tier.

The Cannibalization Problem

Before we get to the flash tier, let’s zoom out. Because the five-tier stack only makes sense once you understand what’s happening to the global memory supply.

Silicon Wafer Consumption per GB of Memory

Standard DDR51x

1 GB DDR5 = 1x wafer area (baseline)

GDDR7 (GPU memory)1.7x

1.7x wafer area per GB vs DDR5

HBM4 (AI accelerator)3.5x

3-4x wafer area per GB — each HBM chip cannibalizes 3-4 DDR5 chips

Source: TrendForce. Every 1 GB of HBM4 produced consumes the silicon that could have made 3-4 GB of DDR5 for laptops and phones.

Here’s the number that explains the entire memory crisis: HBM consumes 3-4x more silicon wafer per gigabyte than standard DDR5. Every single HBM4 chip manufactured for a Vera Rubin GPU cannibalizes the silicon that could have produced 3-4 laptop or smartphone memory chips.

The global DRAM wafer capacity is approximately 2 million 300mm wafer starts per month — about 22% of all semiconductor fab capacity. In 2026, AI workloads (HBM + GDDR7) will consume 20% of all DRAM wafer capacity, according to TrendForce. And that’s just AI. Data centers overall will consume 70% of all memory chips produced in 2026.

Where DRAM Goes: Server/DC vs Mobile vs PC

2024

Server/DC
38%

Mobile
35%

PC
22%

2026E

Server/DC
48%

Mobile
28%

PC
18%

Server/DC

Mobile

Other

Server/DC share growing from 38% to 48% of DRAM bits shipped. Mobile and PC shrinking. AI is cannibalizing consumer memory.

The shift is dramatic. In 2024, servers took 38% of DRAM bits. By 2026, it’s 48% — while mobile drops from 35% to 28% and PCs from 22% to 18%. AI is literally eating the memory that would have gone into your phone and laptop.

And it gets worse. OpenAI’s Stargate project alone — just one customer — could consume up to 40% of global DRAM output, with reported deals for up to 900,000 wafers per month from Samsung and SK Hynix.

One project. 40% of global DRAM output. This is why SK Group’s chairman says wafer shortages persist through 2030.

The Memory Pricing Supercycle

The supply-demand imbalance has created the most extreme memory pricing in a decade:

DRAM: Pricing in early 2026 is 7-8x higher than the same period in 2025. No price relief expected this year.
NAND: Samsung raised prices 100% in Q1 2026 and plans another 100% in Q2. TrendForce projected a 90% QoQ surge. Legacy NAND ASPs are “getting crazier.”
Samsung’s paradox: Their own memory business is so profitable that the high prices are crushing their smartphone division — operating profit down 60% YoY. Samsung is cutting Galaxy Z Tri-Fold sales because memory costs make the phones unprofitable.
Phison (NAND controller maker) moved to a prepayment model — customers must pay upfront before any supply is allocated. That’s how tight the market is.
Server quotes now expire in days, not weeks, as OEMs scramble to lock in prices before the next hike.
Gartner predicts the entry-level PC segment will disappear by 2028 because memory costs make budget laptops uneconomical.

Samsung’s impossible position captured in one sentence: their memory division is tripling HBM output and raising NAND prices 100%, which is so successful that it’s destroying their smartphone division’s profitability. The company that makes the memory is being killed by the price of its own memory.

The China Wildcard

There’s a supply variable that most Western analysts undercount: China is building its own memory industry at breakneck speed.

CXMT (ChangXin Memory Technologies) scaled DRAM production from 100,000 to 200,000 wafer starts per month in 2024 — and is targeting 300,000 WSPM by 2026. That’s roughly 13-15% of the global DRAM wafer base, from a company most Western investors have never heard of. Lenovo is already adopting CXMT’s LPDDR5X modules. Chinese domestic firms are shifting procurement toward CXMT at scale.

YMTC (Yangtze Memory Technologies) is mass-producing 232-layer and 294-layer Xtacking 4.0 NAND, targeting 15% of global NAND market share by 2026. And here’s the pivot: YMTC’s third Wuhan fab, coming online in 2027, will dedicate 50% of its capacity to DRAM production — diversifying from NAND into the memory type where China has the most to gain.

Together, China’s DRAM market share is projected to reach 10-11% by 2027. That’s not dominant, but it’s enough to reshape the supply equation — especially for non-AI memory where CXMT’s DDR5 and LPDDR5X directly competes with Samsung, SK Hynix, and Micron’s consumer-grade output.

The strategic risk profile by memory type is telling:

HBM: Lowest China risk — requires advanced packaging (TSVs, CoWoS) that China can’t replicate
DRAM: Medium risk — CXMT is real competition for commodity DDR5/LPDDR
NAND: Highest China risk — YMTC at 15% share and growing, with competitive layer counts

This creates a fascinating dynamic for the AI memory stack. HBM — the tier that matters most for AI — is the one where China is least competitive. But in standard DRAM and NAND — the tiers that feed CXL memory pools and GPU-direct flash — China is a growing force. The memory shortage that drives the five-tier stack is partly a Western problem that China’s domestic production could partially relieve, at least for the lower tiers.

Tier 4: GPU-Direct Flash — The Deep Pool

NVIDIA’s Storage-Next initiative creates a brand new memory tier: flash storage that GPUs can access directly, bypassing the CPU entirely.

Kioxia’s GP Series is the first product: a Super High IOPS SSD using XL-FLASH Storage Class Memory, with 512-byte granularity (vs 4KB for standard SSDs). This enables GPUs to read fine-grained data from flash — like pulling a specific KV cache entry — without the overhead of traditional block I/O.

Kioxia also announced the CM9 Series: 25.6 TB PCIe 5.0 SSDs designed specifically for Context Memory Storage (CMX) — NVIDIA’s architecture for KV cache persistence. When a user’s conversation session goes cold, the KV cache moves from CXL to flash. When they come back, it’s reloaded.

This tier handles: checkpoints, cold KV cache, model weight overflow, training data staging. At $0.10-0.50/GB, you can store petabytes affordably.

NVIDIA Dynamo 1.0 — the inference operating system that just entered production with 7x Blackwell boost — orchestrates data movement across all of these tiers.

The Math That Changes Everything

Memory Cost per Vera Rubin NVL72 Rack (Estimated)

HBM420.7 TB

$10000K - $16000K

90-94% of cost

CXL Memory50-100 TB

$250K - $1000K

2-6% of cost

GPU-Direct Flash100+ TB

$10K - $50K

<1% of cost

Total memory per rack:$10M - $17M

HBM4 is <20% of total capacity but 90%+ of total cost. CXL and flash provide the capacity HBM can't afford.

Here’s the part nobody is talking about: the memory bill for a single AI rack may rival or exceed the GPU cost.

For a Vera Rubin NVL72:

HBM4: 20.7 TB at $500-800/GB = $10-16 million — and that’s just the memory that sits on the GPU
CXL: 50-100 TB at $5-10/GB = $250K-1M — the KV cache overflow layer
GPU-Direct Flash: 100+ TB at $0.10-0.50/GB = $10-50K — cold storage and checkpoints

Total memory cost per rack: $10-17 million. HBM4 is less than 20% of the total capacity but more than 90% of the cost.

This asymmetry explains everything:

Why Samsung is tripling HBM output (the margins are extraordinary)
Why CXL matters (it provides 5-10x more capacity for 1% of the cost)
Why NVIDIA created Storage-Next (flash is 1,000x cheaper than HBM per GB)
Why GPU cluster costs are up 30% from non-NVIDIA factors (memory is a huge driver)

HBM4 = 20% of capacity, 90% of cost. The entire CXL + flash infrastructure exists to avoid putting another dollar into HBM.

What This Means

The AI memory hierarchy is no longer GPU → DRAM → SSD. It’s:

SRAM (speed) → HBM (compute) → CXL (capacity) → HBF (bridge) → GPU-Direct Flash (depth)

Each tier has specific products shipping in 2026. Each tier addresses a different constraint. And the software layer — Dynamo — ties them all together.

For memory suppliers, this is the biggest opportunity since the smartphone era. Samsung, SK Hynix, and Micron aren’t just selling HBM — they’re selling into every tier of a five-layer stack. Marvell is selling the switches. Kioxia is selling the flash. Penguin is selling the CXL servers.

For AI labs, this changes procurement from “how many GPUs?” to “what’s the memory architecture?” The rack configuration for training (HBM-heavy, bandwidth-optimized) will look fundamentally different from inference (CXL-heavy, capacity-optimized).

The bottleneck was never compute. It was always memory. And now every layer of memory is being rebuilt at once.

Confidence:

High

Medium

Low

NVIDIA Vera Rubin GPU delivers 288GB HBM4 with 20.5 TB/s memory bandwidth per GPU. NVL72 rack provides 1,580 TB/s aggregate GPU memory bandwidth (2.74x vs GB300 NVL72's 576 TB/s). Features NVLink 6 at 3.6 TB/s bidirectional. Shipping 2026.

Source: NVIDIAsurfaced Mar 2026

ab46b3ae

Marvell announced the Structera S 30260, a 260-lane CXL switch enabling rack-level memory pooling. Works with Structera A near-memory accelerators, Structera X memory-expansion controllers, and Alaska P PCIe/CXL retimers.

Source: @MarvellTechsurfaced Mar 2026

4377cff7

Kioxia has announced the development of the KIOXIA GP Series, a Super High IOPS SSD designed to enable GPUs to directly access high-speed flash memory as an expansion to HBM in AI systems.

Source: Andrew Jollysurfaced Mar 2026

3f5a1886