Skip to content

Transformer Dimensions as Direct Semantic Registers: Training-Free Interpretability via Sign Patterns

In a nutshell: The signs of individual dimensions in transformers carry semantic information and enable feature detection without training or rotation, opening a new path to mechanistic interpretability.

Researchers show that the standard basis of transformer hidden states already provides a training-free feature representation: individual dimensions encode semantic content through their sign (plus/minus) and confidence through their magnitude. This mechanism works across language, vision, and audio models.

The “Bag of Dims” framework treats hidden-state dimensions as independent binary registers. A feature is defined as a subset of dimensions with consistent sign patterns; detection occurs through counting sign matches without learned rotation. Validation spans seven models: Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B (language), DINOv2, ViT-Base (vision), and AST (audio).

Pure signs alone suffice to carry predictive content. Magnitude-1 sign patterns retain 60–93% of top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring achieves 80–90% top-4096 accuracy. From a single-token cache (one forward pass per token, no context, no labels), 175 categories are detected with AUC 0.97–0.99 per sign match. A trained probe adds only +0.018 AUC and converges to axis-aligned weights.

The detected features are causally effective: they survive K/V attention projections, can be traced back to FFN neuron coalitions (random-weight controls never replicate this), and flipping the sign of a feature during live forward pass suppresses its concept across four language models—magnitude- and concept-specifically tuned. Dimensions remain independent (mutual information below 0.006 bits).

The structure is not language-model-specific: the same sign patterns per dimension appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories). This suggests the structure emerges from transformer training itself, not from the language-modeling objective. Feature readout works with standard basis in one forward pass, without optimization and GPU-intensive computation. The open problem shifts from finding the right rotation to systematically cataloging what each dimension encodes.


Source: arxiv.org · Published 16 June 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrasing and classification by Lumi News Pipeline v1.7.1.

Share on: