DeepSeek-V4 Explained: The End of Standard Attention in LLMs?

In this post, we break down DeepSeek-V4, a new large language model architecture from DeepSeek designed for highly efficient reasoning over million-token contexts.

DeepSeek-V4 paper title: DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-V4 paper title (Source)

Why Long-Context Reasoning Is Becoming Critical for AI Models

Agentic workflows are rapidly scaling up, and as they do, they require models to reason over longer and longer contexts. At the same time, the rise of reasoning models and test-time scaling has pushed models to spend more computation verifying and refining their own outputs.

As a result, efficiently handling extremely long contexts is becoming more important than ever. However, there’s still a major bottleneck. The standard attention mechanism scales quadratically with sequence length, making ultra-long context reasoning incredibly expensive.

Introducing DeepSeek-V4

DeepSeek’s new paper tackles this problem head-on. It is titled DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence.

If million-token context becomes practical, it could fundamentally change what large language models can do. The key idea to keep in mind while we go through the architecture is that instead of attending to every token in the sequence, the model compresses most of the context and focuses computation where it matters.

DeepSeek-V4 Efficiency Improvements

DeepSeek-V4 Key Results
DeepSeek-V4 Key Results (Source)

In the above figure on the right, we can see massive efficiency gains. The bottom chart tracks the memory required to store the Key-Value (KV) cache as the sequence length increases. At a one-million-token context length, the key-value cache for DeepSeek-V4-Pro is 9.5 times smaller than for DeepSeek-V3.2. So, we still need to load the model’s weights, but the inference memory footprint is dramatically reduced.

The upper chart shows the computational effort required to generate a single token at various positions in a sequence. Here as well, we see dramatic savings at the one-million-token mark.

Moving to the chart on the left, we can see that DeepSeek-V4 is competitive with top proprietary models. The authors estimate that it trails frontier models by approximately 3 to 6 months.

DeepSeek-V4 High-Level Architecture

DeepSeek-V4 High-Level Architecture
DeepSeek-V4 High-Level Architecture (Source)

At the bottom of the figure above, we have the input token embeddings. From there, the input flows through multiple pathways. One pathway processes the input through the main Transformer blocks.

The first major module is the attention block, but unlike standard Transformers, DeepSeek-V4 replaces full attention with specialized attention mechanisms. One is CSA, short for Compressed Sparse Attention, and the other is HCA, short for Heavily Compressed Attention. Different layers alternate between these two attention mechanisms, and we’ll dive into how they work in a moment.

The second major module is DeepSeekMoE, which contains the mixture-of-experts feed-forward layers.

Manifold-Constrained Hyper-Connections (mHC) in DeepSeek-V4

Another pathway is the residual stream on the left, which passes the input directly between components and layers. Now, unlike standard residual connections, you may notice that the residual stream contains multiple embedding vectors instead of just one.

This comes from the use of mHC, short for Manifold-Constrained Hyper-Connections, which we covered in a dedicated post a few months ago. This new type of residual connections is a key ingredient in DeepSeek-V4. In short, it upgrades the residual stream to have much more expressive power comparing to the standard residual connections.

The input is expanded to a larger dimension before passing through the residual stream. However, processing this larger representation directly inside the Transformer blocks would dramatically increase computation costs. So before entering the attention or mixture-of-experts modules, the representation is reduced back to its original size. And after each module finishes processing, the representation is expanded again into the larger residual stream.

At the end, after a stack of the Transformer layers, the output is generated using the same prediction head and multi token prediction modules that were used in DeepSeek-V3.

Most of the efficiency gains in DeepSeek-V4 come from the new attention mechanisms, so let’s now dive deeper into how those work.

Heavily Compressed Attention (HCA) Explained

Heavily Compressed Attention (HCA) High-Level Architecture
Heavily Compressed Attention (HCA) High-Level Architecture (Source)

In the above figure, we can see the high-level architecture of this attention mechanism. At the bottom, we have the hidden state of a query token on the right, and the entire sequence of KV tokens on the left, representing both the keys and the values.

In standard attention, the query token attends to all other tokens in the sequence. This is what would happen if we would immediately send these raw inputs straight to the attention component at the top. But, we actually pass a much smaller sequence of hidden states to the attention block.

How do we get this smaller sequence?

HCA’s Compression Mechanism – A Global Sequence Summary

The first part comes from a component called the Token-Level Compressor. This module compresses entire groups of token hidden states into single entries. More specifically, each group of 128 tokens is replaced by just a single token.

This single entry has to capture all the meaningful content from the tokens group it replaces. To do that, the compressor learns to extract the most critical information. This is done using dedicated weights and a softmax operation, which assigns an importance level to each token in the group.

Adding Local Context to HCA’s Final Sequence

This heavy compression provides a highly efficient global summary of the sequence, but it naturally loses a lot of local, fine-grained context. And since the most recent tokens are usually the most important for predicting the next one, the model compensates for this.

The compressed sequence is concatenated with the last immediate tokens right before the query token. These are the sliding window tokens which we can see on the left.

Finally, the actual attention is calculated between this much smaller sequence and the query token. Now that we see how the model gets its global summary, let’s move on to understand the second type of attention used here, called Compressed Sparse Attention.

Compressed Sparse Attention (CSA) Explained

Compressed Sparse Attention (CSA) High-Level Architecture
Compressed Sparse Attention (CSA) High-Level Architecture (Source)

We can see the high-level architecture of this attention mechanism in the above figure from the paper. Just like before, at the bottom we have the hidden state of our query token on the right, and our entire sequence of KV tokens on the left, which represent both the keys and values. Here again, the sequence is dramatically compressed before it goes through the attention block at the top. But the compression mechanism is different.

CSA’s Compression Mechanism – Keep The Most Relevant Sections

While HCA squeezed 128 tokens down to 1, the first stage here is much gentler, and it only compresses 4 tokens into a single entry. The mechanism itself is similar, using learnable weights and a softmax operation to filter out the noise and keep the most important information.

One difference here though is that the compression process uses overlapping windows. It looks at 8 tokens, the current 4 and the previous 4, so each token is considered for two entries in the compressed sequence.

After this phase, we’ve cut the sequence size by 4. But if we are processing a 1-million token prompt, we’re left with a quarter-million tokens which is still a very long sequence.

CSA’s Compression Leverages DeepSeek Sparse Attention (DSA)

To reduce it further, the compressed sequence passes into another component called the Top-k Selector, which selects only the absolute most important k entries out of the compressed sequence, while the rest are filtered out. But how does it know which entries are the most important? We can see it gets index scores from a module called the Lightning Indexer. This comes from DeepSeek Sparse Attention (DSA) which was introduced in the previous version, DeepSeek-V3.2.

To generate these scores efficiently, the indexer looks at a low-resolution version of the compressed sequence. It is compressed using the same mechanism, but to a lower dimension to save compute. The query token is also converted to this smaller dimension. Then, attention is used to score the importance of the different entries in the compressed sequence based on the current query.

CSA’s Final Compressed Sequence

After the Top-k selector, the size of the newly compressed sequence is now k, which is set to 1024 in the pro version, and 512 in the flash version.

And finally, just like we saw with the previous attention mechanism, this highly-filtered sequence is concatenated with the last 128 uncompressed tokens to preserve immediate local context.

So, even for extremely large sequences, the attention block still only needs to process a sequence of size of 1152 tokens. Compared to HCA, CSA acts like a more selective attention mechanism. Instead of compressing aggressively, it tries to preserve the most important regions of the sequence.

How DeepSeek-V4 Prevents Information Loss

Going back to the high-level architecture for a moment, we mentioned that these two types of attention are interleaved. One immediate concern is loss of data between layers. In other words, whether we could lose crucial information that was not important enough during compression in a specific layer.

What protects the model against such a concern is the residual stream, which let the uncompressed information propagate forward to subsequent layers.

DeepSeek-V4 Results

Scaling Up Test-Time Scaling

DeepSeek-V4 Max Mode's Performance Boost
DeepSeek-V4 Max Mode’s Performance Boost (Source)

Because the model handles long contexts so efficiently, DeepSeek realized they could scale up test-time compute further than before. They did that by reducing the penalty for generating long sequences during the reinforcement learning stage. This version of the model is called the Max mode.

In the above figure, we can see the benefit of that. By scaling test-time compute, the DeepSeek-V4 series achieves substantial improvements over its predecessor. We can clearly notice the jump in performance by moving from High mode to Max mode.

DeepSeek-V4 Benchmark Results

DeepSeek-V4 Benchmark Results
DeepSeek-V4 Benchmark Results (Source)

The above table from the paper shows performance comparison of DeepSeekV4 against top tier closed-source and open-source models across multiple benchmarks. DeepSeekV4-Pro Max is overall competitive with top-tier propriety models, and outperforms top open-source models on almost all benchmarks, including a staggering 20 point gap in the knowledge benchmark SimpleQA-Verified.

Long-Context Retrieval Results

Long-Context Retrieval Results
Long-Context Retrieval Results (Source)

Finally, we’ve talked a lot about the efficiency of processing 1 million tokens, but does the model actually remember what it reads? The above figure answers that by showing long-context retrieval results. Specifically, it shows the Multi-Needle Retrieval accuracy graph, scaled all the way up to the 1-million-token context window.

In this test, the model is challenged to successfully find and extract multiple specific facts hidden deep within a massive document.

We can see that up to 128k the model keeps stable performance. Afterwards it starts to degrade, but still keeps an impressive performance even on 1 million-token sequences.

References & Links

All credit for the research goes to the researchers who wrote the paper we covered in this post.

DeepSeek-V4 Teaser
DeepSeek-V4 Teaser
Scroll to Top