In brief: Jailbreak attempts leave measurable entropy signatures in LLM hidden layers that are more reliable than static averages.
Researchers have developed a method to detect jailbreak attacks on large language models by analyzing prediction entropy in intermediate network layers. The signal is not found at input or output, but reveals itself through structured uncertainty patterns in the model’s internal representations.
Jailbreak attacks circumvent LLM safety training through carefully formulated prompts that elicit policy-violating responses. Previous defense mechanisms focus on the input or output layer. This research investigates where and how malicious intent is encoded in the model’s inner representations.
The analysis employs token-level prediction entropy across network layers using the logit lens method. The key finding: static aggregate statistics of prompt-level entropy (mean, variance) have low discriminative power. By contrast, features capturing entropy evolution across token positions—such as monotonic rank-based trend scores—demonstrate significantly greater predictive value. Crucially, layer distribution matters: the signal concentrates in intermediate network layers and degrades at the final layer. This suggests that jailbreak-relevant structures are localized more in intermediate representations than at the output head.
The method was tested across multiple architectures (Llama, Qwen, Gemma) and various adversarial benchmarks without additional training. The entropy dynamics provide consistent separation between legitimate and jailbroken prompts. This clarifies both which entropy-derived features encode malicious intent and where in the network this signal is strongest.
Source: arxiv.org · Published June 22, 2026
Lumi AI News — AI-assisted curation pursuant to Art. 50 EU AI Act. Paraphrase and classification through Lumi News Pipeline v1.7.1.