Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

1Rutgers University, 2Carnegie Mellon University, 3University of Minnesota, 4New Jersey Institute of Technology,
Accepted by ICML 2025
Introduction

In transformer-based Large Language Models with RoPE(like Llama, Gemma), the attention queries (Q) and keys (K) exhibit concentrated massive values in certain dimensions.



Introduction

Q and K Embedding Vector in Llama-2-7B, we choose Layer 10 and 20. This visualization shown here is a two-dimensional image because we averaged over the sequence-length dimension. The horizontal axis is the number of heads, and the vertical axis is head dim. We can see that the massive value is concentrated at the bottom of the picture.

Abstract

Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show for the first time that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs. Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (i.e., knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE) and it appears since very first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization.



🔍 Key Findings

We highlight three core findings from our study of massive values in LLM attention mechanisms.

📊 Functional Role of Massive Values

We conduct extensive experiments that show that massive values in Q and K matrices play a crucial role in contextual knowledge retrieval, while having a limited effect on parametric knowledge.

⚙️ Impact on Quantization

We evaluate 3 quantization strategies and find that methods explicitly addressing massive values better preserve contextual understanding. This suggests the need for quantization-aware designs.

⏱️ Temporal Origin Analysis

Through causal and temporal analysis, we show that massive values originate from the RoPE mechanism and appear as early as the initial layers.

📊Experiments

📈 Massive Values Contribute to Contextual Knowledge Understanding

As shown in Table 1, Parametric Knowledge Retrieval tasks still maintain relatively high accuracy even when massive values are disrupted, showing degradation of only 15–20%. However, for Contextual Knowledge Understanding tasks, massive values play a crucial role in preserving performance.

For example, Cities tasks maintain strong performance (76–88%), while Technology and Celebrity tasks remain above 70%. In contrast, disrupting non-massive values causes less than 1% performance drop, highlighting the specificity of massive values' role.

On reasoning tasks like GSM8K, accuracy drops are dramatic (e.g., Gemma2-9B: 81.3% → 15.1%), and Passkey tasks collapse from 100% to near-zero accuracy (0–2%). IMDB sentiment accuracy also drops from 94% to single digits. This emphasizes the importance of massive values in preserving contextual reasoning ability.

🧪 PPL and Diversity Score

In addition to accuracy, we assess the models with perplexity (PPL) and diversity as complementary metrics. Lower PPL suggests better modeling confidence, while higher 2-gram diversity indicates richer output. Both metrics support the same conclusion: massive values are essential for contextual understanding.

CD Introduction



📉 Massive Values & Quantization

We evaluate three quantization methods — AWQ, SmoothQuant, and GPTQ — to test how well they preserve massive values.

AWQ and SmoothQuant explicitly preserve massive values and maintain strong performance across all tasks. In contrast, GPTQ, which doesn’t, suffers major accuracy drops on reasoning tasks like GSM8K and AQUA.

This gap confirms that preserving massive values is key to contextual understanding. Without it, models struggle in complex reasoning.

CD Introduction

Impacts of different quantization methods on Llama3-8b across different benchmarks.



🧩 Why Do Concentrate Massive Values Appear — and Why Only in Q/K?

We find that Rotary Position Embedding (RoPE) is the root cause of the emergence of concentrated massive values in large language models.

RoPE is selectively applied only to the Query (Q) and Key (K) matrices, but not to the Value (V). This asymmetric design causes extreme activations to appear exclusively in Q and K, while V remains smooth and unstructured.

We verify this across a wide range of LLMs. All RoPE-based models (e.g., LLaMA, Qwen, Gemma, Qwen-VL, LLaVA) show clear massive value concentrations in Q/K. In contrast, models without RoPE (like OPT, GPT2, and Jamba) show no such patterns.

In a controlled comparison, GPT2-NEO vs GPT2-NEOX (only differing by RoPE) confirms this: only GPT2-NEOX exhibits massive values. Even when using enhanced RoPE like M-RoPE in Qwen2-VL, the effect persists.

These findings confirm that RoPE is both necessary and sufficient to induce concentrated massive values in Q/K, a phenomenon absent in models using other position encoding strategies.

Conclusion

Our study provides novel insights into the role and origin of massive values in Large Language Models (LLMs). Through systematic investigation, we find that massive values are critical in contextual knowledge understanding tasks, such as passkey retrieval and IMDB sentiment understanding. In contrast, their influence on parametric knowledge retrieval tasks, such as world knowledge retrieval, is limited. This finding emphasizes the importance of preserving massive value to maintain model performance in reasoning and context-dependent tasks. Our investigation reveals that RoPE induces massive value stripes, distinct patterns exclusively in the Q and K, while absent in models without RoPE, such as OPT. This highlights how positional encoding mechanisms contribute to massive values, particularly low-frequency channel dimensions, offering new insights into RoPE's role in LLMs. This study establishes a deeper understanding of massive values in LLMs, their critical role in contextual knowledge understanding, their implications for model optimization techniques such as quantization, and their connection to RoPE-induced patterns. These findings lay the foundation for developing more robust, efficient, and interpretable LLM architectures and optimization strategies.

BibTeX

@article{jin2025massive,
  title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
  author={Jin, Mingyu and Mei, Kai and Xu, Wujiang and Sun, Mingjie and Tang, Ruixiang and Du, Mengnan and Liu, Zirui and Zhang, Yongfeng},
  journal={arXiv preprint arXiv:2502.01563},
  year={2025}
}