Jevons Paradox: Why Every AI Optimization Makes the Hardware Shortage Worse
In 1865, the economist William Stanley Jevons observed that James Watt’s more efficient steam engine did not reduce coal consumption. It increased it. The efficiency improvement made steam power economical for new applications — factories, railways, ships — and total coal demand exploded.
One hundred and sixty-one years later, the same mechanism is playing out in AI inference. And we now have the data to prove it.
The Evidence: OpenRouter’s 100-Trillion-Token Dataset
OpenRouter published a landmark study in January 2026 analyzing over 100 trillion tokens processed across its platform. The category breakdown tells the story:
Programming went from roughly 11% of total token volume in early 2025 to over 50% by March 2026. Agent-driven workflow tokens now exceed half of all platform output. (One caveat: OpenRouter added category tags mid-2025, so the early figure is retroactive classification — but the direction and magnitude of the shift are unmistakable.)
This didn’t happen because existing programmers started using more tokens. It happened because an entirely new use case — agentic coding — was created by cheaper, faster inference. Tools like Claude Code, Cursor, and Codex turned programming from a human activity into a human-directed AI activity. Each agentic coding session consumes orders of magnitude more tokens than the chat conversations it displaced — SWE-bench data shows agents consuming 100K-2M tokens per bug fix.
Claude on OpenRouter is over 80% programming workloads. Not chat. Not creative writing. Code.
The Mechanism: Cheaper Tokens Create New Markets
The pattern is consistent across every efficiency improvement in the AI stack:
Frontier output token prices dropped from $60/million (GPT-4 launch, March 2023) to $0.28/million (DeepSeek V3, January 2025). A 200x price reduction in under two years. Over the same period, estimated global token consumption grew from under a trillion per day to 30-50 trillion.
The price elasticity of inference demand is enormous. Every 10x drop in cost unlocks a new class of application:
- $60/M tokens: Enterprise early adopters running experimental queries
- $15/M tokens: Developers using AI assistants in daily workflows
- $0.75/M tokens: Autonomous coding agents running multi-step workflows
- $0.10/M tokens: Every CI pipeline, every code review, every test suite augmented with AI
We are at the $0.28-0.75 tier now. The applications being built for $0.10 — continuous agent loops, real-time code monitoring, automated security auditing — will consume orders of magnitude more tokens than everything that came before.
Anthropic just made this explicit. Claude Code on Max/Team/Enterprise now defaults to a 1M token context window — 5x the previous 200K limit — at standard pricing with no per-token surcharge. A single agentic coding session can now consume a million tokens. That’s not a theoretical maximum; it’s the default. Multiply by the number of developers using Claude Code daily, and you begin to see why efficiency improvements create demand rather than reduce it.
Cerebras demonstrated the same dynamic from the hardware side: their partnership with AWS runs Trainium for prefill and CS-3 for decode, achieving 1,200 tokens per second. A 10-step agent chain completes in under 3 seconds at that speed. At 50 tok/s (standard GPU inference), the same chain takes over 30 seconds. That 24x speedup doesn’t just make existing workloads faster — it makes multi-step agentic workflows practical for the first time. New demand, created by efficiency.
The Memory Paradox: TurboQuant and Engram Grow Demand
This is where the Jevons mechanism gets specific to semiconductors, and where the market got the TurboQuant selloff completely wrong.
TurboQuant compresses KV cache by 4-6x. The naive interpretation: GPUs need less HBM. The actual outcome: operators fill the freed memory with more concurrent users, keeping total HBM utilization flat while serving more revenue per GPU.
SanDisk CFO David Visoso explicitly named this: efficiency improvements raise the ROI of hyperscale capex, which drives more investment. AMD VP Mario Morales said the same at SEMICON China — AI efficiency fuels demand growth under Jevons paradox.
DeepSeek Engram goes further. It replaces dense model layers with static hash lookup tables that don’t need HBM bandwidth — they need cheap, high-capacity DDR5 or LPDDR5. So Engram doesn’t reduce memory demand. It shifts it from expensive HBM to cheaper DRAM, while the freed HBM gets consumed by larger models or more concurrency. We covered the full implications of this architectural shift in DeepSeek’s Memory Divorce.
The result: more total gigabytes deployed per server, not fewer. HBM wafer consumption stays high (Jevons). DDR5 demand increases (new tier). Memory manufacturers sell more units across more product lines. AMD’s MI455X was accidentally designed for exactly this three-tier workload: weights on HBM4, compressed KV cache, Engram tables on LPDDR5.
SK Hynix’s response to TurboQuant was not to cut capacity. It was to file for a US IPO targeting $10-14 billion, specifically to fund HBM expansion, and to place a $79 billion ASML EUV equipment order.
The Gross Margin Trap
If Jevons paradox drives infinite demand growth, why are AI inference companies losing money?
Anthropic’s gross margins were reportedly -94% in 2024, according to SemiAnalysis. MiniMax reported -25%. These are companies whose entire business is serving inference tokens.
The paradox resolves when you separate aggregate demand from unit economics. Total demand is exploding. But each unit of demand pays less than it costs to serve at the frontier. The companies that benefit from Jevons paradox are not the ones selling tokens — they’re the ones selling the picks and shovels: GPUs, HBM, substrates, helium, and electrical power.
This is why NVIDIA’s margins are expanding while Anthropic’s are deeply negative, and why memory manufacturers have pricing power despite every efficiency paper published. The efficiency gains lower the price of inference, which creates demand, which requires more hardware, which increases hardware prices, which makes the inference providers’ unit economics worse even as their revenue grows.
It’s Jevons all the way down.
What Breaks the Cycle?
Three things could slow the Jevons flywheel:
-
Token demand saturates. Every useful task that could be done by an AI agent is being done by one. Given that programming alone went from 11% to 50% in a year and most industries haven’t even started adopting agentic workflows, saturation is years away.
-
Hardware supply catches up. Possible in theory, but 2026 has 7 simultaneous supply constraints across orthogonal axes (helium, tungsten, PCB materials, HBM wafers, inspection equipment, water, energy). We wrote about this separately. For the full picture on how HBM yield economics connect NVIDIA and AMD to the same Samsung wafer, see our HBM4 Yield Game analysis.
-
Regulation caps compute deployment. The Sanders/AOC data center moratorium, 100+ local moratoriums, and growing public backlash (200 protesters at Anthropic’s HQ this week) suggest this is the most plausible constraint. If you can’t build data centers, you can’t deploy more GPUs, regardless of demand.
The Investment Implication
This is analysis, not financial advice. We hold positions in some names mentioned. Do your own research.
If you believe Jevons paradox applies to AI inference — and the OpenRouter data strongly suggests it does — then:
-
Every efficiency paper is bullish for hardware. TurboQuant, Engram, MoE architectures, speculative decoding, distillation — all of these lower the cost of a token, which increases demand for tokens, which increases demand for the silicon that serves them.
-
Memory pricing power persists. The TurboQuant selloff was a gift. Server DRAM pricing is up 93-98% QoQ and accelerating. SK Hynix is raising $14 billion for capacity. The market is buying, not selling.
-
The real risk is supply constraints, not demand destruction. The semiconductor supply chain is already struggling with converging bottlenecks. The demand side is a solved problem. The question is whether we can build fast enough.
-
Something you might not know: Micron’s own data shows DDR5 is currently more profitable than HBM. If Engram-style architectures shift the memory mix from HBM toward DDR5, that’s not margin compression for memory manufacturers — it’s margin expansion on higher volume. The “HBM demand destruction” narrative isn’t just wrong about demand. It’s wrong about margins too.
We estimated in our Token Tsunami analysis that the world is serving 30-50 trillion tokens per day with capacity set to grow 10-20x by year-end. The Jevons mechanism explains why that capacity will get filled, not sit idle.
William Stanley Jevons figured this out in 1865. The AI market is learning the same lesson at $2.59 per GPU-hour.