Sparse Autoencoders: Interpretable Features Insufficient for Reliable Model Control18. June 2026AI Models, Cybersecurity, RegulationSAE-based safety measures are vulnerable to post-intervention recovery: models can restore suppressed behaviors even when targeted features are controlled. Share on:
ICALens: Interpretability Method for Language Models Without Training Additional Autoencoders11. June 2026AI Models, Claude AIICA-based analysis enables rapid exploration of interpretable directions in language models without expensive training of additional autoencoders. Share on: