Moonshot AI proposes new method for how LLM layers share information, claims 1.25x compute advantage

Moonshot AI's Kimi Team has published a technical report proposing a change to a fundamental component of modern AI models called residual connections.^[1]

The new method, called Attention Residuals, lets each layer in a neural network selectively choose which earlier layers to draw from, rather than receiving a fixed mixture of everything that came before.

Based on scaling law experiments, the team says the method matches the performance of a standard model trained with 1.25x more compute, while adding less than 2% inference latency.

The method was tested on Moonshot's Kimi Linear model (48B total parameters, 3B activated).^[2]

What problem does this solve

Every modern LLM is built from dozens or hundreds of stacked layers, each processing and refining the model's understanding of an input. Information passes between layers through residual connections, a technique introduced in 2015 that simply adds each layer's output to a running total carried forward.^[3]

This running total treats every layer's contribution equally. A layer near the end of the network receives the same blend of all previous layers regardless of what the input is or what information would actually be useful. As models get deeper, this equal weighting causes a dilution problem: each individual layer's contribution gets increasingly buried in the growing sum.

Separately, prior research has shown that a significant fraction of layers in deep models can be pruned with minimal performance loss, and that early-layer information cannot be selectively recovered once buried in the accumulated residual stream.^[4]

What the Kimi Team changed

The team's fix draws on a parallel between two problems in AI architecture. RNNs, an older type of neural network, compressed all prior information into a single state when processing sequences, making it impossible to selectively retrieve earlier inputs. Transformers solved that problem by introducing attention, which lets the model selectively focus on the most relevant parts of a sequence.

The team argues that residual connections have the same structural limitation over depth: they compress all prior layer outputs into a single running sum with no way to selectively retrieve individual contributions. Attention Residuals applies the same fix, but across layers instead of across words.

Instead of blindly summing everything, each layer now uses a lightweight attention mechanism to decide how much weight to give each preceding layer's output. The added cost per layer is minimal: one small learned vector and one normalization operation.

To make this practical at scale, the team also introduced Block Attention Residuals. This groups layers into blocks and applies the selective attention across block-level summaries rather than individual layers, keeping overhead low enough for production training. The team reports that grouping into roughly 8 blocks captures most of the benefit.

Results

The team tested Attention Residuals across five model sizes, using identical training settings to ensure fair comparison. At every scale, the new method achieved lower loss than the standard approach.

At the largest scale, fitting standard power-law curves to the results showed a 1.25x compute advantage: the new method reached the same validation loss as a baseline that used 25% more training compute. Training overhead under pipeline parallelism was less than 4%, and inference latency increased by less than 2%.

On a full-scale 48B-parameter model, the team reports improvements across all benchmarks tested. The gains were largest on tasks requiring multi-step reasoning: GPQA-Diamond (a graduate-level science benchmark) improved by 7.5 points, math reasoning by 3.6 points, and code generation on HumanEval by 3.1 points. General knowledge and language understanding benchmarks also improved, though by smaller margins.

All results are self-reported by the Kimi Team. Independent replication has not been published.

Context

Attention Residuals builds on Moonshot AI's Kimi Linear architecture, released in October 2025. The team changed only the residual connection mechanism and kept all other training details identical, isolating the impact of the new method.

The approach is not the first attempt to improve residual connections. Prior methods have explored similar ideas with mixed results: some showed no improvement over the baseline, while others improved performance but at significantly higher memory cost. The Kimi Team's ablation study found their method outperformed these alternatives while adding far less overhead.

Moonshot AI, the Beijing-based company behind the Kimi chatbot, has been releasing a steady stream of model and architecture research. The company released Kimi K2 in July 2025 and Kimi K2.5 in January 2026, and is currently seeking to raise up to $1 billion in an expanded funding round that would value the startup at $18 billion.^[5]

Why this matters

A 1.25x compute advantage, if it generalizes, is significant. For labs spending millions on model training, that could mean either meaningful cost savings or better models at the same budget.

What makes this particular result notable is the low adoption cost. The method requires no changes to the training data, optimizer, or attention mechanism. It modifies only the residual connections and adds less than 2% inference overhead, which makes it the kind of improvement that could realistically be adopted without redesigning an existing training pipeline.

The key uncertainty is generalization. The paper validates the approach across multiple scales but within a single architecture family. Whether the gains hold across different model designs is an open question.