The proposed U.S. federal law makes reporting of severe AI security incidents a legal requirement with a seven-day deadline and penalties up to $2 million per violation.
SAE-based safety measures are vulnerable to post-intervention recovery: models can restore suppressed behaviors even when targeted features are controlled.
RepSelect isolates forget-set-specific representations through selective gradient component collapsing and achieves 4-50x greater robustness against relearning attacks than existing methods.
Poisoned documents can turn reasoning-based AI guardrails into DoS weapons by leveraging security systems themselves as resource sinks—a new attack vector with concentration risks in shared governance infrastructure.
The White House removed Anthropic’s Fable model from the market with export controls after concerns about bypassed security safeguards, following failed intensive negotiations between government officials and CEO Amodei.
Grammar-Constrained Decoding (GCD), a technique for ensuring syntactically correct code, opens a new jailbreak method for attackers with a success rate over 30 percentage points higher than previous approaches.
Anthropic splits Claude Fable 5 into a public version (with safeguards) and a restrictive version (Claude Mythos 5 without security layers) for verified cybersecurity experts.
Multi-turn reasoning models can maintain safe surface metrics while their internal states are compromised across conversation turns or their secure internal logic is ignored in harmful outputs.
Claude Fable 5 demonstrates significant performance improvements over predecessor models, while Anthropic simultaneously tightens access controls that set a regulatory precedent for the industry.
Anthropic publicly releases the more powerful Claude variant Fable 5, but automatically routes potentially dangerous cybersecurity requests to a weaker model.