Musk's retweet of Kimi's paper sparked a big discussion in Silicon Valley, what is the next battlefield for Attention?

Wallstreetcn
2026.03.20 13:36

Elon Musk retweeted the Kimi team's paper "Attention Residuals," sparking heated discussions in Silicon Valley, with Karpathy and former OpenAI co-founder Jerry Tworek sharing their views on it. Meanwhile, the ByteDance Seed team collaborated with Huazhong University of Science and Technology to release another related paper "Mixture-of-Depths Attention," and a paper by Nanjing University and others titled "When Does Sparsity Mitigate the Curse of Depth in LLMs" was also published in the same week. These three papers focus on the structural issues of attention mechanisms, marking significant progress in the field

On March 16, 2026, the Kimi team uploaded a paper titled "Attention Residuals" to arXiv, and things quickly spiraled out of control. Musk retweeted it, Karpathy commented, "We haven't really taken the title 'Attention is All You Need' seriously," and former OpenAI co-founder Jerry Tworek directly gave four words: deep learning 2.0. A structural paper from a Chinese team causing this level of discussion in Silicon Valley hasn't happened since DeepSeek-V3.

But amidst the excitement, most discussions remained at the level of "Kimi has created something new, and the big shots are excited." What was overlooked is that on the same day, the ByteDance Seed team and Huazhong University of Science and Technology jointly published another paper called "Mixture-of-Depths Attention (MoDA)," addressing the exact same problem but using a completely different approach. In the same week, a third paper by Nanjing University's Dilxat Muhtar, MPI's Shiwei Liu, and others titled "When Does Sparsity Mitigate the Curse of Depth in LLMs" provided the most precise pathological report from a theoretical perspective.

Three papers emerged in quick succession, targeting the same goal. This is not a coincidence. A structural problem that has been ignored for nearly a decade has finally reached a critical point that must be addressed.

The issue does not lie in the sequence dimension of attention. Attention has evolved many generations over the past few years, from multi-head attention to grouped query attention, to DeepSeek's MLA, to various sparse variants, each generation optimizing how tokens look at each other. This arms race is fascinating, but it obscures a fact—the way information is transmitted between layers has remained the same since the publication of the Transformer paper in 2017. Residual connections, h = h + f(h), an addition operation without any learning parameters.

The outputs of all historical layers are summed equally. There is no selection, no forgetting, no learning. The contributions of each layer are indiscriminately piled into the residual stream, regardless of whether they learned key features or noise.

Residual connections are the most successful "temporary solution" in the history of deep learning.

The Most Successful Temporary Solution

Residual connections were proposed by He Kaiming in ResNet in 2015. The idea is extremely simple: when the network is stacked to over twenty layers, it becomes untrainable due to gradient vanishing, causing deep parameters to hardly update. So, a "highway" is added for each layer, allowing the input to skip that layer and connect directly to the output. Even if that layer learns nothing, information and gradients can at least be transmitted through this shortcut. The effect was immediate, pushing ResNet from over twenty layers to more than a hundred. Two years later, the Transformer was born, and the residual connection was transferred intact Since then, this design has not been changed.

Not that no one has tried. ReZero, FixUp, and Highway Network have all created variants that allow residual weights to be learned. However, none of them made it into the mainstream architecture of large models because residual connections are too useful. They are simple, stable, and add almost no computational overhead; at the scale of models at that time, the side effects had not yet been exposed.

44% of layers are idling

What are the side effects? In early 2025, a team led by Shiwei Liu from Westlake University, Emory, and MPI published "The Curse of Depth," and in March of this year, Dilxat Muhtar and others from Nanjing University further provided quantitative diagnostics in "When Does Sparsity Mitigate the Curse of Depth in LLMs." Under the current mainstream architecture of large models, deeper transformations are increasingly approaching the identity mapping. Whatever you input is what you output; this layer is effectively nonexistent.

The numbers are grim. Researchers use "utility scores" to measure whether each layer is performing meaningful transformations. In a 12-layer model, all layers are working. In a 16-layer model, three layers are wasted. In a 24-layer model, nine layers are wasted. In a 32-layer model, 14 layers are wasted, with 44% of the layers learning almost nothing. The parameter count increased from 900 million to 2.3 billion, spending 156% more of the budget, while the effective layers only increased from 12 to 18.

Quantitative diagnosis of the curse of depth—efficiency decay of effective layers as model size increases

The reason is directly related to how residual connections work. The output of each layer is added to a "main road" through residual connections. As the number of layers increases, the accumulated signal on the main road becomes larger (which can be understood as the "background volume" continuously rising), but the amplitude of the new signal generated by each layer is limited. By the time it reaches deeper layers, the new signals are drowned in background noise, making the input and output almost identical; this layer is effectively useless.

Residual connections solve the problem of "passing the gradient," but create the problem of "making deeper layers meaningful."

In the era of large models, this cost translates to real money. One layer represents tens of billions of floating-point operations. If a 128-layer model has 44% of its layers idling, nearly sixty layers' computational power is doing useless work. The community has spent years optimizing inference efficiency, with quantization, distillation, pruning, sparse attention, and KV cache compression, all aimed at optimizing those "useful computations."

The biggest efficiency black hole is not in the quadratic complexity of attention, but in an addition operation that hasn't changed since 2015.

Adding a depth dimension to attention

The Byte Seed team's MoDA chose a different path. It did not alter the residual connections but added a second dimension to the attention mechanism itself The standard Transformer operates attention only on the sequence dimension, meaning that each token in the current layer looks at the KV of other tokens in the same layer. The modification in MoDA is intuitive, as it includes the KV of historical layers in the attention candidate set. When a token performs attention calculation at layer L, it can not only see other tokens in the same layer but also directly refer back to the KV from layer 1 to layer L-1. The sequence dimension and depth dimension are jointly normalized under the same Softmax.

The idea is easy to understand, but the challenge lies in how to implement it without compromising speed.

MoDA's dual-dimensional attention mechanism—joint normalization of sequence dimension and depth dimension under the same Softmax.

Stuffing all historical layer KVs into attention would lead to an explosion in computation. In a 32-layer model, the 32nd layer needs to look at all KVs from the previous 31 layers, effectively increasing the sequence length by 32 times. The engineering core of MoDA is a "group reordering" strategy that selects only a portion of historical layer KVs, rearranging them into contiguous memory to allow efficient execution of matrix multiplication on the GPU.

Specifically, MoDA introduces a "depth flow" mechanism. Not every layer looks at all historical layers; instead, it selects the most relevant layers through a learnable routing. This is similar to the Mixture-of-Experts approach—not activating all experts but dynamically selecting the needed ones. The difference here is that the "experts" are historical layers of different depths.

At a sequence length of 64K, MoDA's operator efficiency reached 97.3% of FlashAttention-2. With the entire depth attention mechanism added, the speed only slowed down by less than 3%.

Group reordering strategy—moving historical layer KVs scattered across memory to contiguous memory areas.

On a model with 1.5 billion parameters (based on the OLMo2 training recipe), MoDA improved average performance by 2.11% across 10 downstream tasks, with an additional computational overhead of only 3.7%. At first glance, this may seem small, but it is an architectural improvement, not achieved through more data or longer training. Moreover, the effectiveness of MoDA increases with model size—deeper degradation becomes more severe in larger models, making MoDA's corrective effect more pronounced.

Performance comparison of MoDA on 10 downstream tasks

Interestingly, the chemical reaction between MoDA and Post-Norm is noteworthy. Mainstream large models almost all use Pre-Norm (normalization before attention), because Post-Norm (attention before normalization) is theoretically superior but unstable during training. MoDA's deep KV mechanism provides an additional gradient channel for Post-Norm, alleviating its original instability issue.

The combination of MoDA and Post-Norm opens up the possibility that compromises made in the past for training stability (using Pre-Norm) may be retracted.

Difference in validation loss between Pre-Norm and Post-Norm after adding deep KV

No new paths, renovating old paths

MoDA does not alter the residual connection; it chooses to open another path outside of the residual. On the same day, the Kimi team released Attention Residuals (AttnRes), which took a more direct route by directly modifying the residual connection itself.

The standard residual connection simply sums the outputs of all previous layers equally and stacks them into the main path. There is no choice, no forgetting. AttnRes replaces this fixed equal-weight addition with an attention operation, where each layer uses its own state as a query and the outputs of all previous layers as candidates, using attention to determine which features from previous layers are useful for the current layer and their respective weights.

The residual connection transforms from a fixed formula into a learnable dynamic routing.

The core idea of AttnRes—using attention to replace equal-weight residual addition

The cost is that each layer must perform an additional deep attention computation, which is not insignificant. The Kimi team controls costs with a block strategy (Block AttnRes), dividing layers into several blocks, performing complete deep attention within blocks, and focusing on block-level aggregation representations between blocks.

AttnRes has already been integrated into Kimi Linear (48 billion total parameters / 3 billion activation parameters), pre-trained on 1.4 trillion tokens, with consistent results across different model sizes. This paper has been widely reported, and the technical details will not be elaborated here. The reason it is worth mentioning here is its comparison with the MoDA route.

The training curves and ablation experiments of AttnRes

The causes diagnosed by the two routes are completely consistent, that is, the shallow information obtained from the deep layers is repeatedly diluted by the residual updates. However, the approach taken is different. MoDA does not touch the residual connections but adds a depth dimension to the attention, allowing the deep layers to bypass the residual flow and directly access the original features of the shallow layers. AttnRes directly addresses the residual connections, replacing equal-weight addition with attention weighting. One is "building a new path," while the other is "renovating the original path."

The two papers appeared on the same day, with different routes but the same target. This is not a coincidence. The depth issue of attention has become a consensus in the research community, with the only difference being the direction of approach.

The consistency of AttnRes's effectiveness across different model scales

The scaffolding that was forgotten to be dismantled

Returning to the initial question, why has the issue of deep layers idling only been seriously addressed in 2026?

Because residual connections are too useful. They solved one of the most pressing problems at the time (gradient vanishing), with manageable costs (deep degradation is not obvious in small models), and alternative solutions were immature (ReZero and Highway Network had not undergone large-scale validation). No one had the motivation to change it. It was not an intentionally retained design choice but a forgotten temporary solution. The scaffolding built initially was left up after the building was completed, and over time, everyone thought it was a load-bearing wall.

The signal dilution effect of residual connections— the deeper the layers, the harder it is for new signals to be heard

However, what truly made this issue difficult to discover is not the residual connections themselves, but the fact that the attention mechanism has long operated in only one dimension. Over the past eight years, all evolutions of attention—multi-head, grouped queries, sparse, linear—have focused on the sequence dimension. The way tokens look at each other has been optimized countless times. But how do layers look at each other? This question has never been asked. The depth dimension is a blind spot for attention.

MoDA and AttnRes have opened up this blind spot from different directions. MoDA adds a second dimension to attention, allowing it to operate simultaneously in both the sequence and depth directions. AttnRes transforms the inter-layer information transfer itself into an attention operation. The routes are different, but they both point to the same conclusion: attention should not only look horizontally; it should also look vertically.

The implications of this conclusion are greater than the two papers themselves. There are still many fixed mechanisms in Transformers that operate in a single dimension. Each layer must be executed in order and cannot be skipped. Each attention head computes independently and is simply concatenated, with no dynamic coordination between heads. Each token, regardless of difficulty, follows the exact same computational path These designs were originally compromises made to allow the model to train and converge.

The evolution of deep learning over the past decade, when abstracted to the highest level, is about one thing: returning more and more structural decisions from human designers back to the model itself. Handcrafted convolutional kernels have been replaced by learnable attention. Fixed positional encodings have been replaced by learnable rotational encodings. Fixed expert assignments have been replaced by learnable routing. Now, the way information flows in the deep dimension is also beginning to be determined by attention itself.

Karpathy said we have not yet taken the literal meaning of "Attention is All You Need" seriously. He may be right. But not in the sense of "attention is enough," but rather "attention has not been fully utilized." It has evolved many generations in the sequence dimension, but has only just begun in the deep dimension.

Depth is the next battlefield for attention.

Source: Tencent Technology

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at one's own risk