The gist: Language models respond more strongly to text formatting than to actual content, making them vulnerable to manipulation through cleverly styled inputs that resemble internal system commands.

Researchers have demonstrated that language models like GPT-4 and similar systems cannot distinguish between their own system instructions and user inputs. Instead, they primarily orient themselves based on text formatting.

Researchers including Charles Ye, Jasmine Cui and Dylan Hadfield-Menell investigated how language models differentiate their own privileged system texts (in tags like <system>, <think> and <assistant>) from untrusted user inputs. The key finding: models cannot reliably recognize these boundaries and instead orient themselves based on text formatting and writing style of the inputs.

The researchers demonstrated this through a simple experiment: when they first provided a model with a harmful request and then appended text formatted in the same style as internal system thoughts, models like GPT-OSS-20B were able to override their original security guidelines. A concrete example showed that a model began to release previously prohibited content after the addition of appropriately formatted “policy” clauses.

Particularly striking was the effect of “destyling”: when researchers introduced the same harmful information with slightly different formatting that humans barely noticed, the success rate of jailbreak attacks dropped from an average of 61% to 10%. The change was barely perceptible to human readers but was decisive for models.

The research team calls this mechanism “role confusion” and identifies it as a core problem in modern prompt injection attacks. As long as language models do not master true role perception, injection protection measures will remain subject to a permanent arms race. Particularly problematic: continuous boundaries between roles create attack surface for subtle, mass-scale model manipulation through apparently harmless texts.

Source: simonwillison.net · Published June 23, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrasing and classification by Lumi News Pipeline v1.7.1.

Share on:

Language Models Confuse System Instructions with User Input

Lumi AI News

Legal

Topics