DeepSeek V4: The Million-Token Paper That's Really About Memory

Reading the DeepSeek V4 technical report, one thing stood out: roughly a third of the paper is about memory. Here's why that makes complete sense — and why Huawei Ascend is part of the story.

I spent time this past weekend going through the DeepSeek V4 technical report. It covers two new MoE models - V4-Pro (1.6T parameters, 49B activated) and V4-Flash (284B parameters, 13B activated) - both supporting a one-million-token context window. The depth and detail of the technical report is impressive. The architecture is clever, and I learned a lot from it. But one thing kept nagging at me.

Why is so much of this paper about memory?

A frontier model release, in my mental model, should be about accuracy, capability, maybe latency. DeepSeek V4 has all of that. But an unusual amount of the paper's intellectual energy, roughly a third of the substantive paper (~7,500 tokens out of ~23,000), goes into compression, quantization, heterogeneous cache management, and on-disk storage strategies. It reads more like a systems paper written under real hardware pressure.

I had a hunch about why. Let me walk through it.

The Memory Work by DeepSeek v4

The architectural innovations in V4 are genuinely interesting, but almost all of them are pointed at the same target: shrinking the KV cache.

Compressed Sparse Attention (CSA) compresses the key-value cache along the sequence dimension by a factor of ~1/m, then runs sparse attention over a mix of compressed global entries and a small sliding window of recent uncompressed tokens. It's attention that forgets strategically.

Heavily Compressed Attention (HCA) goes further - far more aggressive compression, but keeps dense attention rather than sparse. Think of it as: compress harder, but look at everything you kept. CSA and HCA are interleaved across transformer blocks, and together they're the reason V4-Pro at 1M tokens needs only 10% of the KV cache that V3.2 would require.

Sliding Window Attention (SWA) appears as an additional branch inside both CSA and HCA, and it's the one part that stays uncompressed - there to preserve local fine-grained dependencies. But since SWA entries exist at every layer and are never compressed, at scale they become the biggest storage problem in the whole system. Which is why the paper introduces three separate on-disk SWA caching strategies (full caching, periodic checkpointing, and zero-caching with recomputation), each offering a different tradeoff between storage overhead and compute.

FP4 quantization-aware training runs through MoE expert weights and the indexer QK path, shaving memory further and enabling future hardware to potentially run these operations 1/3 more efficiently than FP8.

Manifold-Constrained Hyper-Connections (mHC) is a bit different - it's primarily about training stability, constraining residual mappings to doubly stochastic matrices to prevent the numerical instability that plagued vanilla Hyper-Connections at scale. But even here, a dedicated section (3.5.2) describes the "cost-effective and memory-efficient implementation of mHC" using recomputation and fused kernels, because the naive version would be too expensive to run.

TileLang, the domain-specific language they use to write all their fused kernels, cuts CPU-side invocation overhead from hundreds of microseconds to under one microsecond per call. Less directly about memory, but deeply connected - every microsecond of host overhead is time the accelerator is starved, which matters most when you're IO-bound.

And underneath all of it: a heterogeneous KV cache structure that manages three completely different types of cache entries (compressed CSA/HCA entries, SWA entries, and uncompressed tail tokens) under one roof, with optional spill to disk.

That's a lot of infrastructure for a next-generation foundation model. There must be a reason.

Bright side of the moon: Memory is a P0 at Scale.

Here's the thing that reframed this for me.

LLM inference - the token-by-token generation phase - is memory-bandwidth bound instead of compute-bound. When you generate a new token, the model has to load the entire KV cache from HBM into SRAM to attend over it. You're not limited by how fast you can multiply matrices. You're limited by how fast you can move data.

Now scale the context to one million tokens.

A rough estimate for a V3.2-scale model puts the KV cache at 400+ GB per user session at 1M tokens. That's more than four H100s worth of memory, for a single conversation. Forget about batching. Forget about serving multiple users. You simply cannot offer this as a product without compressing the hell out of that cache.

This is the insight that reframed the whole paper for me. The memory work isn't a distraction from capability work - it is the capability work. Without it, a million-token context window is a demo. With it, you can actually run test-time scaling over long-horizon agent tasks, maintain reasoning across a full codebase, or build something like persistent memory into the model itself. The paper is direct about this: these architectural choices are what make "routinely supporting one-million-token contexts" possible in production.

So the memory obsession is justified on the merits, for any team working at this context length, on any hardware.

But there's one more layer to it.

Dark side of the moon: The Ascend Factor

DeepSeek validated their fine-grained expert parallelism scheme on — and I'm quoting directly from the paper — "both NVIDIA GPUs and HUAWEI Ascend NPUs platforms." That's not a throwaway line. That's a design target.

Here's why it matters. The US export controls on advanced semiconductors mean DeepSeek can't freely procure H100s, H200s, or GB200s. For training, they've been working with H800s - NVIDIA's export-compliant variant with NVLink bandwidth cut roughly in half compared to the H100. For inference and deployment inside China, Huawei's Ascend series is the strategic alternative.

The Ascend 910B, which has been the workhorse for Huawei's AI infrastructure, runs HBM2e and delivers approximately 36% of H100's memory bandwidth. Even the newer 910C, with 128GB of HBM3, benchmarks at roughly 60% of H100 inference performance overall. The upcoming Ascend 950 gets to about 80% of H100's memory bandwidth at ~3.2 TB/s - better, but still behind.

Memory bandwidth is the exact bottleneck that matters most in long-context decoding. A chip with 36–60% of H100's bandwidth doesn't just run slower - it's that much more sensitive to the size of the KV cache you're asking it to stream on every token generation. Every byte you compress out of the KV cache translates more directly into latency savings on lower-bandwidth hardware.

The same logic applies to the fine-grained wave-based expert parallelism work of DeepSeek v4. The scheme splits MoE experts into waves and overlaps computation, dispatch, and result-sending continuously. It achieves 1.50–1.73× speedup on general inference, up to 1.96× on latency-sensitive RL rollouts. But its real purpose is to hide communication latency when your interconnect is not NVLink-class - which is exactly the situation on both H800s and Ascend.

And at the end of Section 3.1, DeepSeek publishes explicit advice to hardware vendors: balance computation-to-communication ratios rather than scaling bandwidth unconditionally; provide sufficient power headroom for fully concurrent workloads; design lower-latency cross-GPU signaling. That's not generic advice. That's a team that has been debugging these exact constraints on Ascend and H800 clusters writing notes to the people building the next chips.

So the full picture is this: the memory work would have been necessary at 1M context on any hardware. But the specific architecture of the constraints DeepSeek operates under - lower-bandwidth training chips, a strategic need to run efficiently on Ascend - turns "necessary" into "urgent and deeply optimized."

Two sides of the same planet. The physics of memory at million-token scale was always going to force this work. The hardware DeepSeek actually runs on made it urgent.

References

The paper itself

DeepSeek V4 Technical Report — the full PDF on HuggingFace
DeepSeek V4 Preview Release Notes — official API docs announcement

On KV cache as the million-token bottleneck

Mastering LLM Techniques: Inference Optimization — NVIDIA's deep-dive on why decode is memory-bandwidth bound, not compute-bound
KVQuant: Towards 10 Million Context Length LLM Inference — NeurIPS 2024 paper on KV cache quantization at extreme context lengths
Optimizing Inference for Long Context with NVFP4 KV Cache — NVIDIA on FP4 KV cache and the memory math at scale

Huawei Ascend vs. NVIDIA chip comparisons

Huawei Ascend 910C vs. NVIDIA H100 — spec-by-spec breakdown including HBM and bandwidth
Huawei Ascend 950 Unveiled with In-House HBM — TrendForce on the 950's specs and roadmap
NVIDIA vs. Huawei AI Chip Capabilities Comparison — MUFG chart comparing compute, memory, and bandwidth across generations
DeepSeek V4 with Huawei Ascend Support — gHacks on the day-0 Ascend deployment

On export controls and DeepSeek's hardware constraints

DeepSeek and the Effects of GPU Export Controls — clear breakdown of what H800 restrictions mean in practice
China's AI Chip Race: Tech Giants Challenge Nvidia — IEEE Spectrum on the broader domestic chip ecosystem
TileLang-Ascend — the open-source Ascend adapter for the same DSL used in V4's kernel stack

DEC 4

After 9 months in Beta, Otto is now open to everyone! Read our announcement

DeepSeek V4: The Million-Token Paper That's Really About Memory

The Memory Work by DeepSeek v4

Bright side of the moon: Memory is a P0 at Scale.

Dark side of the moon: The Ascend Factor

References

Try Otto free for 1 year

$10/mo. Free – no credit card required. No contracts, no agent-assist fees, no minimum spend

Other technical posts