RL environments with software bugs (stale cache, reward hacks, false state transitions) generate toxic training data that sabotage agent training – systematic quality validation is necessary.
While video generation models produce visually convincing movements, visual quality does not correlate with practical executability by robots — an evaluation criterion overlooked by standard metrics.
Malicious npm packages can overwrite Claude Code’s configuration file, steal OAuth tokens from the network, and use them to access all connected enterprise services while audit logs show clean Anthropic IP addresses.
Hidden-state alignment reduces sampling variance, closes the student-teacher gap more effectively, and trains with less memory and computational time than output-only distillation.
Unvalidated input in Anthropic’s Claude Code GitHub Action enabled complete repository takeover via a simple issue, with potential impact on all dependent downstream projects.
Hugging Face Transformers allows silent remote code execution via obfuscated parameters in model configurations as long as the optional kernels package is installed (CVE-2026-4372, patched in 5.3.0).
CHERRL enables reproducible analysis of reward hacking mechanisms through controlled bias injection and automatic detection of exploitation onset in LLM-based training.
STRIDE formalizes training data attribution as a sparse recovery problem in activation space, achieving an order of magnitude faster results than gradient-based methods.