AI-based code agents can be manipulated through prepared GitHub repositories to execute hidden malware without common security checks detecting the risk.
Anthropic’s Opus 4.6 withstood 6,000 prompt injection attacks in a public security test without compromise, indicating improved defense mechanisms — but such stability results do not replace comprehensive security design in production.
Language models respond more strongly to text formatting than to actual content, making them vulnerable to manipulation through cleverly styled inputs that resemble internal system commands.
AI security requires fundamental differences from traditional cybersecurity: prompt injection creates a new exploit class for agents, specialized red-teaming models outperform humans at uncovering weaknesses, and larger models are not automatically more robust.
Legitimate AI agents inherently satisfy all three criteria of the “lethal trifecta” (data access, external content, external communication), so security must shift from architectural design to runtime monitoring.
Current AI web agents lack reliable defenses against prompt injection attacks and can fulfill attack objectives undetected while users remain unaware of the threat.