SGNL Intelligence.
EN 中文
8 min read

Jevons Paradox: Why Every AI Optimization Makes the Hardware Shortage Worse

Jevons ParadoxInferenceTurboQuantEngramMemoryHBMDRAMSupply ChainOpenRouter

In 1865, the economist William Stanley Jevons observed that James Watt’s more efficient steam engine did not reduce coal consumption. It increased it. The efficiency improvement made steam power economical for new applications — factories, railways, ships — and total coal demand exploded.

One hundred and sixty-one years later, the same mechanism is playing out in AI inference. And we now have the data to prove it.


The Evidence: OpenRouter’s 100-Trillion-Token Dataset

OpenRouter published a landmark study in January 2026 analyzing over 100 trillion tokens processed across its platform. The category breakdown tells the story:

OpenRouter Token Usage by Category (2025-2026)
Early 2025
11%
Mid 2025
25%
Late 2025
38%
Mar 2026
52%
Coding/Programming
All Other Categories
Source: OpenRouter State of AI study (arXiv:2601.10088). 11% and 50%+ are sourced; mid-points interpolated. Category tags added mid-2025.

Programming went from roughly 11% of total token volume in early 2025 to over 50% by March 2026. Agent-driven workflow tokens now exceed half of all platform output. (One caveat: OpenRouter added category tags mid-2025, so the early figure is retroactive classification — but the direction and magnitude of the shift are unmistakable.)

This didn’t happen because existing programmers started using more tokens. It happened because an entirely new use case — agentic coding — was created by cheaper, faster inference. Tools like Claude Code, Cursor, and Codex turned programming from a human activity into a human-directed AI activity. Each agentic coding session consumes orders of magnitude more tokens than the chat conversations it displaced — SWE-bench data shows agents consuming 100K-2M tokens per bug fix.

OpenRouter data: programming tokens grew from 11% to 50%+ in one year. Agent-driven workflows now exceed half of total platform output. The usage category barely existed 18 months ago.

Claude on OpenRouter is over 80% programming workloads. Not chat. Not creative writing. Code.


The Mechanism: Cheaper Tokens Create New Markets

The pattern is consistent across every efficiency improvement in the AI stack:

Frontier Model Pricing vs. Global Token Demand
GPT-4
Mar 2023
$60
per 1M output tokens
0.5T
est. daily global tokens
GPT-4 Turbo
Nov 2023
$30
per 1M output tokens
1.2T
est. daily global tokens
Claude 3 Opus
Mar 2024
$75
per 1M output tokens
3T
est. daily global tokens
GPT-4o
May 2024
$15
per 1M output tokens
8T
est. daily global tokens
DeepSeek V3
Jan 2025
$0.28
per 1M output tokens
18T
est. daily global tokens
DeepSeek V3.1
Mar 2026
$0.75
per 1M output tokens
45T
est. daily global tokens
Prices: frontier model output token pricing at launch (official API rates). Volume: industry estimates from disclosed data. As prices fall 200x, demand grows 90x.

Frontier output token prices dropped from $60/million (GPT-4 launch, March 2023) to $0.28/million (DeepSeek V3, January 2025). A 200x price reduction in under two years. Over the same period, estimated global token consumption grew from under a trillion per day to 30-50 trillion.

The price elasticity of inference demand is enormous. Every 10x drop in cost unlocks a new class of application:

  • $60/M tokens: Enterprise early adopters running experimental queries
  • $15/M tokens: Developers using AI assistants in daily workflows
  • $0.75/M tokens: Autonomous coding agents running multi-step workflows
  • $0.10/M tokens: Every CI pipeline, every code review, every test suite augmented with AI

We are at the $0.28-0.75 tier now. The applications being built for $0.10 — continuous agent loops, real-time code monitoring, automated security auditing — will consume orders of magnitude more tokens than everything that came before.

Anthropic just made this explicit. Claude Code on Max/Team/Enterprise now defaults to a 1M token context window — 5x the previous 200K limit — at standard pricing with no per-token surcharge. A single agentic coding session can now consume a million tokens. That’s not a theoretical maximum; it’s the default. Multiply by the number of developers using Claude Code daily, and you begin to see why efficiency improvements create demand rather than reduce it.

Cerebras demonstrated the same dynamic from the hardware side: their partnership with AWS runs Trainium for prefill and CS-3 for decode, achieving 1,200 tokens per second. A 10-step agent chain completes in under 3 seconds at that speed. At 50 tok/s (standard GPU inference), the same chain takes over 30 seconds. That 24x speedup doesn’t just make existing workloads faster — it makes multi-step agentic workflows practical for the first time. New demand, created by efficiency.


The Memory Paradox: TurboQuant and Engram Grow Demand

This is where the Jevons mechanism gets specific to semiconductors, and where the market got the TurboQuant selloff completely wrong.

TurboQuant compresses KV cache by 4-6x. The naive interpretation: GPUs need less HBM. The actual outcome: operators fill the freed memory with more concurrent users, keeping total HBM utilization flat while serving more revenue per GPU.

SanDisk CFO David Visoso explicitly named this: efficiency improvements raise the ROI of hyperscale capex, which drives more investment. AMD VP Mario Morales said the same at SEMICON China — AI efficiency fuels demand growth under Jevons paradox.

DeepSeek Engram goes further. It replaces dense model layers with static hash lookup tables that don’t need HBM bandwidth — they need cheap, high-capacity DDR5 or LPDDR5. So Engram doesn’t reduce memory demand. It shifts it from expensive HBM to cheaper DRAM, while the freed HBM gets consumed by larger models or more concurrency. We covered the full implications of this architectural shift in DeepSeek’s Memory Divorce.

Per-GPU Memory Deployment Under Efficiency Gains (Illustrative)
BaselinePre-TurboQuant, Pre-Engram
80 GB
HBM 80
TurboQuantKV cache 4x compressed
80 GB
HBM 80
Same HBM, 4x more concurrent users
EngramKnowledge offloaded to DRAM
180 GB
HBM 60
DDR5 120
Both + JevonsEquilibrium after demand fills freed capacity
280 GB
HBM 80
DDR5 200
HBM4
DDR5 / LPDDR5
Illustrative per-GPU memory deployment. TurboQuant frees KV cache headroom but operators fill it with concurrency. Engram shifts knowledge to DDR5. Net: more total GB deployed.

The result: more total gigabytes deployed per server, not fewer. HBM wafer consumption stays high (Jevons). DDR5 demand increases (new tier). Memory manufacturers sell more units across more product lines. AMD’s MI455X was accidentally designed for exactly this three-tier workload: weights on HBM4, compressed KV cache, Engram tables on LPDDR5.

TrendForce revised Q1 2026 Server DRAM pricing from +60-65% to +93-98% QoQ. NAND from +33-38% to +85-90%. These are not the numbers of an industry facing demand destruction.

SK Hynix’s response to TurboQuant was not to cut capacity. It was to file for a US IPO targeting $10-14 billion, specifically to fund HBM expansion, and to place a $79 billion ASML EUV equipment order.


The Gross Margin Trap

If Jevons paradox drives infinite demand growth, why are AI inference companies losing money?

Anthropic’s gross margins were reportedly -94% in 2024, according to SemiAnalysis. MiniMax reported -25%. These are companies whose entire business is serving inference tokens.

The paradox resolves when you separate aggregate demand from unit economics. Total demand is exploding. But each unit of demand pays less than it costs to serve at the frontier. The companies that benefit from Jevons paradox are not the ones selling tokens — they’re the ones selling the picks and shovels: GPUs, HBM, substrates, helium, and electrical power.

This is why NVIDIA’s margins are expanding while Anthropic’s are deeply negative, and why memory manufacturers have pricing power despite every efficiency paper published. The efficiency gains lower the price of inference, which creates demand, which requires more hardware, which increases hardware prices, which makes the inference providers’ unit economics worse even as their revenue grows.

It’s Jevons all the way down.


What Breaks the Cycle?

Three things could slow the Jevons flywheel:

  1. Token demand saturates. Every useful task that could be done by an AI agent is being done by one. Given that programming alone went from 11% to 50% in a year and most industries haven’t even started adopting agentic workflows, saturation is years away.

  2. Hardware supply catches up. Possible in theory, but 2026 has 7 simultaneous supply constraints across orthogonal axes (helium, tungsten, PCB materials, HBM wafers, inspection equipment, water, energy). We wrote about this separately. For the full picture on how HBM yield economics connect NVIDIA and AMD to the same Samsung wafer, see our HBM4 Yield Game analysis.

  3. Regulation caps compute deployment. The Sanders/AOC data center moratorium, 100+ local moratoriums, and growing public backlash (200 protesters at Anthropic’s HQ this week) suggest this is the most plausible constraint. If you can’t build data centers, you can’t deploy more GPUs, regardless of demand.

The history of Jevons paradox in coal, oil, and electricity suggests that efficiency-driven demand growth continues until either the resource is fully substituted (unlikely for silicon) or regulation intervenes (increasingly likely). The question for the AI industry is not whether demand destruction will happen from efficiency — it won’t. The question is whether the physical infrastructure can be built fast enough to meet the demand that efficiency creates.

The Investment Implication

This is analysis, not financial advice. We hold positions in some names mentioned. Do your own research.

If you believe Jevons paradox applies to AI inference — and the OpenRouter data strongly suggests it does — then:

  • Every efficiency paper is bullish for hardware. TurboQuant, Engram, MoE architectures, speculative decoding, distillation — all of these lower the cost of a token, which increases demand for tokens, which increases demand for the silicon that serves them.

  • Memory pricing power persists. The TurboQuant selloff was a gift. Server DRAM pricing is up 93-98% QoQ and accelerating. SK Hynix is raising $14 billion for capacity. The market is buying, not selling.

  • The real risk is supply constraints, not demand destruction. The semiconductor supply chain is already struggling with converging bottlenecks. The demand side is a solved problem. The question is whether we can build fast enough.

  • Something you might not know: Micron’s own data shows DDR5 is currently more profitable than HBM. If Engram-style architectures shift the memory mix from HBM toward DDR5, that’s not margin compression for memory manufacturers — it’s margin expansion on higher volume. The “HBM demand destruction” narrative isn’t just wrong about demand. It’s wrong about margins too.

We estimated in our Token Tsunami analysis that the world is serving 30-50 trillion tokens per day with capacity set to grow 10-20x by year-end. The Jevons mechanism explains why that capacity will get filled, not sit idle.

William Stanley Jevons figured this out in 1865. The AI market is learning the same lesson at $2.59 per GPU-hour.

Confidence:
High
Medium
Low
1.
Anthropic made 1M token context generally available for Claude Opus 4.6 and Sonnet 4.6 at standard pricing ($5/$25 per million tokens), with Claude Code on Max/Team/Enterprise defaulting to 1M context automatically.
Source: Anthropicsurfaced Mar 2026
3de2016d
2.
TurboQuant on MLX achieves a 75% reduction in memory usage.
Source: @mweinbachsurfaced Mar 2026
93045e25
3.
A 10-step agent chain powered by GPT-Codex-5.3-Spark on Cerebras completes in under 3 seconds at 1200 tokens per second.
Source: @zephyr_z9surfaced Mar 2026
63e80063

Get the signal, not the noise

New analysis delivered to your inbox. No spam, unsubscribe anytime.