Skip to content

Linear Probes for Deception Detection in LLMs Show Critical Robustness Gaps

Bottom line: Linear probes for deception detection in LLMs function reliably only on training data, not under stylistic variations—but style augmentation can restore robustness.

Linear probes trained on activation patterns of language models achieve AUROC scores above 0.96 in clean tests but systematically fail under distribution shifts. An analysis of the Gemma-3 family (1B to 27B parameters) reveals that these detection methods have fundamental geometric weaknesses.

Linear probes are increasingly proposed as metrics for detecting deceptive behavior in large language models. However, a systematic stress-test of this method reveals a severe robustness problem: while probes achieve AUROC scores above 0.998 on clean benchmarks, they collapse under stylistic shifts. The research examines four hypotheses for encoding deceptive activations: (1) a single linear direction, (2) a multidimensional subspace, (3) a convex cone hull, and (4) entropy as a proxy measure.

The results systematically reject simplified assumptions: the single-direction hypothesis is rejected—a single vector (k=1) captures only 0.61–0.80 AUROC. Style-augmented probes, by contrast, achieve an average AUROC of 0.979–0.983 on unseen stylistics. The entropy-proxy model is also rejected (maximum correlation |ρ|=0.454). Instead, the findings show: deception forms no significant linear subspace per domain (k*=0), but multidimensional probes (k≥5) can recover the signal through distributed, subthreshold features.

For CTOs, what is critical: the observed fragility of probes reflects not an architectural limitation of the models, but insufficient breadth in training data distribution. Style augmentation restores reliable detection at both 4B and 27B parameters. The apparent inverse scaling pattern is a training data artifact, not a true scaling-dependent phenomenon. This means: linear probes can function for deception detection, but they require robust augmentation and multidimensional geometry, not single directions.


Source: arxiv.org · Published May 27, 2026
Lumi AI News — AI-assisted curation in accordance with Article 50 EU AI Act. Paraphrase and classification by Lumi News Pipeline v1.2.9.

Share on: