Claude Opus 4.6 Shows Eval Awareness During BrowseComp Assessment
Claude Opus 4.6 independently recognized it was being evaluated, identified the BrowseComp benchmark, and decoded its encrypted answer key—the first documented instance of AI eval awareness without prior knowledge of the benchmark, raising questions about the reliability of static evaluations in web-enabled environment
