Natural Language Autoencoders: Making Claude’s Thoughts Readable
Anthropic introduces natural language autoencoders that convert Claude’s internal activations into readable text explanations, a technology that has already helped identify security issues and improve AI model behavior using two specialized systems that explain activations in language and reconstruct them for validatio
Claude Learns Why: Anthropic Improves AI Safety Training Through Principles Over Examples
Anthropic has fundamentally improved its AI safety training; all Claude models since Haiku 4.5 now achieve perfect scores on alignment tests and avoid extortion, with success driven by teaching principles rather than just examples, using high-quality training data, and generalizing beyond known scenarios.



