2026-03-26 8 min read

DeepSeek's Memory Divorce: What Happens When AI Learns to Separate Knowing from Thinking

DeepSeekEngramMemoryDRAMHBM4CXLArchitectureSupply ChainMicronSamsung

Your brain doesn’t solve a math problem the same way it remembers your mother’s birthday. One requires thinking. The other requires looking it up.

AI models don’t make that distinction. Every fact, every memory, every piece of trivia gets processed through the same trillion-dollar neural computation pipeline as every logical deduction. It’s like using a supercomputer to check your contacts list.

DeepSeek just published a paper that separates the two. And the implications for the $200 billion AI hardware industry are enormous.

The Most Expensive Way to Remember a Fact

Here’s a dirty secret about large language models: a huge percentage of their parameters exist purely to memorize facts. The capital of France. The boiling point of water. The release date of the iPhone 15. These aren’t things that require reasoning. They’re things that require lookup.

But in a standard Transformer, there is no lookup. Every piece of knowledge is encoded as patterns across billions of neural weights, processed through expensive matrix multiplications on GPUs, stored in $20-30 per gigabyte High Bandwidth Memory. You’re paying for supercomputer-grade computation to do what a dictionary does for free.

DeepSeek’s Engram paper asks a simple question: what if we just… didn’t do that?

100 billion parameters moved to host DRAM. 2% throughput loss. $5/GB instead of $25/GB.

The Third Axis of Sparsity

To understand why Engram matters, you need to understand what DeepSeek has been building. Over the past two years, they’ve introduced two architectural innovations that make their models radically more efficient:

Mixture-of-Experts (MoE) — Instead of activating all 685 billion parameters for every token, route each token to a handful of specialist sub-networks. Only 37 billion parameters fire per token. This is conditional compute: don’t think about everything at once.

Multi-head Latent Attention (MLA) — Compress the key-value cache (the model’s short-term memory during a conversation) into a low-dimensional latent space. This slashes the memory needed for long conversations by 5-10x. This is memory compression: remember more with less space.

Engram adds a third dimension: conditional memory. Instead of encoding facts in neural weights, store them in massive hash tables that can be looked up in O(1) time — a single step, regardless of table size. No matrix multiplication. No GPU cycles. Just a direct address into a memory bank.

DeepSeek's Three Axes of Sparsity

MoE

Conditional Compute

Route tokens to specialized expert sub-networks. Only 3-8B params active per token out of 685B total.

GPU (HBM)

$20-30/GB

MLA

KV Cache Compression

Compress key-value attention tensors into a low-dimensional latent space. Cuts KV cache memory by 5-10x.

GPU (HBM)

$20-30/GB

Engram

Conditional Memory

O(1) hash lookup into static n-gram tables. Retrieves facts without burning FLOPs or context window.

Host DRAM / CXL

$3-5/GB

Sparsity AxisWhat It DoesLives InCost/GB

Think of it like a university library. MoE is like having specialist professors who only show up for their subject. MLA is like compressing your lecture notes into shorthand. Engram is building an actual library — a massive, organized reference collection that anyone can walk into and grab a book off the shelf without bothering a professor.

The Results Are Startling

DeepSeek tested Engram against an identical-size MoE model at the same computational budget. Same number of parameters. Same FLOPs. The only difference: Engram replaces roughly 20% of MoE expert parameters with hash-based lookup tables.

Engram-27B vs MoE-27B: Same Compute, Better Everything

BBH

Reasoning

ARC-Challenge

Reasoning

+3.7

DROP

Reasoning

+3.3

MMLU

Knowledge

HumanEval

Code

MATH

Math

+2.4

GSM8K

Math

+2.2

Engram-27B vs MoE-27B at equal FLOPs. Source: DeepSeek Engram paper (arXiv:2601.07372)

The surprise isn’t just the knowledge gains (MMLU +3.0) — you’d expect that from a model with better memory. The surprise is that reasoning improves the most (BBH +5.0). Why?

Because when early Transformer layers no longer need to reconstruct facts from neural weights, those attention heads are freed up for something more valuable: actual reasoning. The model thinks better because it doesn’t waste brainpower remembering.

Engram’s biggest gains are in reasoning (+5.0), not factual recall (+3.0). Offloading knowledge makes the model smarter, not just more knowledgeable.

And it scales. The paper reveals a log-linear scaling law: model quality improves predictably as you add more entries to the Engram tables. More memory capacity = smarter model. That’s a new scaling axis that runs on cheap DRAM, not expensive GPU hours.

The 2% Miracle

Here’s the number that matters for the hardware industry.

DeepSeek offloaded 100 billion Engram parameters from the GPU to host DRAM — the regular system memory sitting on the motherboard, not the precious HBM strapped to the GPU. The throughput hit: 2.0% on a 4-billion parameter model (9,031 to 8,858 tokens per second) and 2.8% on an 8-billion parameter model.

How? Asynchronous PCIe prefetching. The model knows which hash table entries it will need a few steps ahead and requests them over the PCIe bus before they’re needed. By the time the compute pipeline is ready for the data, it’s already arrived. The GPU never waits.

This is architecturally profound. It means:

Knowledge lives in $3-5/GB DDR5 instead of $20-30/GB HBM4
GPU memory is freed for what it’s actually good at: compute and KV cache
Quality scales with DRAM capacity, not GPU count

Where Each Type of AI Memory Lives — and What It Costs

SRAM (on-chip)

$$$$$/GB144 TB/sDecode (Groq LPU)

HBM4

$20-30/GB20.5 TB/sCompute + KV cache

CXL Memory

$8-12/GB128 GB/sKV overflow + Engram?

DDR5 DRAM

$3-5/GB51.2 GB/sEngram tables + system

NVMe Flash

$0.10/GB40 GB/s (3FS)Checkpoints + cold KV

Engram's insight: static knowledge doesn't need HBM bandwidth. It needs capacity. Move it to DDR5/CXL at 4-6x lower cost per GB.

The 80/20 Rule of Thinking vs. Knowing

The paper uncovers something elegant: a U-shaped scaling law that governs the optimal split between neural computation and static memory.

Replace too little of your model with Engram tables (less than ~20%), and you’re wasting GPU FLOPs on memorization. Replace too much (more than ~20%), and you don’t have enough neural capacity for reasoning. The sweet spot is right around 80% compute, 20% memory.

This isn’t just a hyperparameter. It’s a design principle for the next generation of AI hardware. If 20% of a frontier model’s intelligence comes from pure memory lookup, hardware architects should be designing systems with heterogeneous memory pools — fast, expensive HBM for the 80% that thinks, and vast, cheap DRAM or CXL memory for the 20% that knows.

The 80/20 split is remarkably close to what neuroscientists estimate for the human brain. Roughly 20% of your cortex is dedicated to semantic memory (facts and knowledge), while the rest handles working memory, reasoning, and sensory processing. DeepSeek may have stumbled onto a fundamental constant of intelligence architecture.

The Retrieval Revolution

Long-context performance is where Engram truly shines. On the RULER benchmark — the gold standard for testing whether a model can actually find and use information buried deep in a long conversation — Engram scores:

97.0 vs 84.2 on multi-query needle-in-haystack (finding specific facts)
89.0 vs 77.0 on variable tracking (following changing values across context)

That’s a 12-15 point accuracy gap on the hardest retrieval tasks. The reason is intuitive: when early layers don’t need to waste attention heads reconstructing memorized facts, those heads are available for actually tracking information across the context window. The model has better “working memory” because its “long-term memory” is handled by hardware instead of neurons.

The DRAM Squeeze Gets Worse

Now here’s where it gets scary for the memory industry — or incredibly bullish, depending on which stock you hold.

Why Engram Makes the Memory Shortage Worse

Three Compounding Forces

HBM Cannibalization

Every HBM4 chip eats 3-4x the wafer area of DDR5. Fabs prioritize HBM because margins are higher.

7-8x Price Surge

DRAM pricing in early 2026 is 7-8x year-over-year. No new fab capacity until mid-2027.

Engram Net-New Demand

500B-1T Engram tables = 250-500 GB extra DDR5 per GPU. Doubles or triples host memory per rack.

Who Benefits

Micron

Strongest domestic DDR5. DDR5 already more profitable than HBM.

Samsung

80%+ DRAM margins. Engram extends the supercycle.

CXL Vendors

PCIe bottlenecks at scale make CXL the natural Engram target.

The DRAM market is already in crisis. Prices are 7-8x higher than a year ago. Samsung, SK Hynix, and Micron won’t have substantial new manufacturing capacity online until mid-2027. And HBM production is actively cannibalizing DRAM wafers — every HBM4 chip requires 3-4x the silicon area of standard DDR5 per gigabyte.

Now add Engram to the equation. If frontier models adopt 500 billion to 1 trillion Engram parameters, each inference server needs 250-500 GB of additional DDR5 per GPU just for the knowledge tables. Across a 72-GPU Vera Rubin rack, that’s 18-36 TB of extra DDR5 — doubling or tripling current host memory requirements.

Engram doesn’t reduce total memory demand. It redirects it. You now need HBM for compute AND massive DRAM for knowledge. The net effect is additive.

The implication is clear. Efficiency breakthroughs like Engram and TurboQuant don’t reduce memory demand — they diversify it. HBM pressure eases slightly as knowledge moves to DRAM, but total silicon demand increases because you’re now using both tiers simultaneously rather than cramming everything into one.

Building Blocks for V4

Engram didn’t arrive alone. DeepSeek has been methodically open-sourcing an entire infrastructure stack that, taken together, looks like the foundation for their next-generation model:

mHC (Manifold-Constrained Hyper-Connections) — A companion paper that solves training instability at massive scale. Standard hyper-connections caused signal amplification exceeding 3000x at 27B parameters, leading to catastrophic divergence. mHC constrains the math to prevent this, with only 6.7% training overhead. This is the scaffolding that lets you train very deep networks with Engram layers.

3FS (Fire-Flyer File System) — A distributed file system delivering 6.6 TiB/s read throughput across 180+ storage nodes. Built for the data-hungry training runs that Engram-scale models demand.

DeepEP — An expert-parallel communication library that achieved 1.3-1.5x throughput gains for MoE models on AMD MI300X clusters. Optimizes the cross-GPU data movement that MoE architectures depend on.

FlashMLA — An optimized decoding kernel for Multi-head Latent Attention on NVIDIA Hopper GPUs. Makes the MLA compression fast enough for production inference.

Together, these four projects make the full MoE + MLA + Engram architecture reproducible outside DeepSeek. They’re not just publishing papers — they’re shipping tools.

This open-source strategy is deliberate. By commoditizing the infrastructure layer, DeepSeek ensures the broader ecosystem adopts their architectural choices — which in turn validates their hardware optimization decisions and creates demand for the kind of heterogeneous memory systems their models are designed to exploit.

What to Watch

The Engram paper tests models up to 27 billion parameters. DeepSeek V4 is rumored at 1 trillion. Does the U-shaped sparsity law hold at frontier scale? Does PCIe bandwidth become the bottleneck at 500B+ Engram parameters, making CXL memory mandatory? Will other labs — Alibaba’s Qwen team, Meta, Mistral — adopt the three-axis framework?

These are the questions that will determine whether Engram is an interesting research paper or a structural shift in how the industry builds AI infrastructure.

One thing is already clear: the assumption that AI models get better only by adding more compute is broken. DeepSeek just showed that adding more memory — cheap, commodity, boring old DRAM — makes models smarter too. And in a world where DRAM prices are already 7-8x year-over-year, that’s a finding with very real financial consequences.

The memory wall didn’t go away. It just got a second front.

Confidence:

High

Medium

Low

DeepSeek Engram introduces conditional memory via O(1) hash-based lookup into static n-gram embedding tables as a new sparsity axis complementary to Mixture-of-Experts, enabling efficient knowledge retrieval separate from dynamic computation.

Source: DeepSeek-AI / Peking Universitysurfaced Mar 2026

e2445390

100B Engram parameters offloaded to host DRAM incur only 2.0% throughput overhead on a 4B-parameter dense model (9,031 to 8,858 tok/s) and 2.8% on an 8B dense model (6,315 to 6,140 tok/s), achieved via asynchronous PCIe prefetching from host memory.

Source: DeepSeek-AI / Peking Universitysurfaced Mar 2026

d2a43691

Engram-27B (26.7B total, 3.8B active params) outperforms iso-parameter MoE-27B across knowledge, reasoning, and code benchmarks at equal FLOPs: MMLU +3.0, BBH +5.0, HumanEval +3.0, ARC-Challenge +3.7, DROP +3.3, GSM8K +2.2, MATH +2.4.

Source: DeepSeek-AI / Peking Universitysurfaced Mar 2026

505b5ac5

The Most Expensive Way to Remember a Fact

The Third Axis of Sparsity

The Results Are Startling

The 2% Miracle

The 80/20 Rule of Thinking vs. Knowing

The Retrieval Revolution

The DRAM Squeeze Gets Worse

Building Blocks for V4

What to Watch

Get the signal, not the noise