📈 Massive Values Contribute to Contextual Knowledge Understanding
As shown in Table 1, Parametric Knowledge Retrieval tasks still maintain relatively high accuracy even when massive values are disrupted, showing degradation of only 15–20%. However, for Contextual Knowledge Understanding tasks, massive values play a crucial role in preserving performance.
For example, Cities tasks maintain strong performance (76–88%), while Technology and Celebrity tasks remain above 70%. In contrast, disrupting non-massive values causes less than 1% performance drop, highlighting the specificity of massive values' role.
On reasoning tasks like GSM8K, accuracy drops are dramatic (e.g., Gemma2-9B: 81.3% → 15.1%), and Passkey tasks collapse from 100% to near-zero accuracy (0–2%). IMDB sentiment accuracy also drops from 94% to single digits. This emphasizes the importance of massive values in preserving contextual reasoning ability.
🧪 PPL and Diversity Score
In addition to accuracy, we assess the models with perplexity (PPL) and diversity as complementary metrics. Lower PPL suggests better modeling confidence, while higher 2-gram diversity indicates richer output. Both metrics support the same conclusion: massive values are essential for contextual understanding.
📉 Massive Values & Quantization
We evaluate three quantization methods — AWQ, SmoothQuant, and GPTQ — to test how well they preserve massive values.
AWQ and SmoothQuant explicitly preserve massive values and maintain strong performance across all tasks. In contrast, GPTQ, which doesn’t, suffers major accuracy drops on reasoning tasks like GSM8K and AQUA.
This gap confirms that preserving massive values is key to contextual understanding. Without it, models struggle in complex reasoning.
🧩 Why Do Concentrate Massive Values Appear — and Why Only in Q/K?
We find that Rotary Position Embedding (RoPE) is the root cause of the emergence of concentrated massive values in large language models.
RoPE is selectively applied only to the Query (Q) and Key (K) matrices, but not to the Value (V). This asymmetric design causes extreme activations to appear exclusively in Q and K, while V remains smooth and unstructured.
We verify this across a wide range of LLMs. All RoPE-based models (e.g., LLaMA, Qwen, Gemma, Qwen-VL, LLaVA) show clear massive value concentrations in Q/K. In contrast, models without RoPE (like OPT, GPT2, and Jamba) show no such patterns.
In a controlled comparison, GPT2-NEO vs GPT2-NEOX (only differing by RoPE) confirms this: only GPT2-NEOX exhibits massive values. Even when using enhanced RoPE like M-RoPE in Qwen2-VL, the effect persists.
These findings confirm that RoPE is both necessary and sufficient to induce concentrated massive values in Q/K, a phenomenon absent in models using other position encoding strategies.