The Point: Anthropic introduces a new Auto Mode for Claude Code that uses model-based classifiers to automatically block dangerous actions while executing safe operations without approval prompts. The system combines an input-side prompt injection probe with an output-side transcript classifier.

Claude Code by default asks users for approval before executing commands or modifying files. While this protects security, it leads to approval fatigue. A new Auto Mode uses AI classifiers to block dangerous actions — a middle ground between manual review and missing safeguards.

By default, Claude Code requires explicit user confirmation for every command and file modification. While this protection mechanism ensures security, repeated approvals lead to fatigue and inattention — users ultimately confirm 93 percent of all requests without thorough review.

Until now, users faced a dilemma: either use an isolated sandbox environment — secure but maintenance-intensive and with limitations on network and host access — or disable all security checks with the “–dangerously-skip-permissions” flag, which in most cases is dangerous.

The new Auto Mode offers an intelligent middle ground. It delegates approval decisions to specialized AI classifiers that automatically block potentially harmful actions while safe operations proceed without manual confirmation.

The system operates with two protection layers: an input-layer probe detects prompt injection attempts in tool outputs before they reach the agent. Suspicious content is flagged so the agent can properly classify it. The output layer uses a transcript classifier based on Sonnet 4.6 that evaluates every planned action before execution. The system works in two stages: a fast single-token filter makes the initial decision; only suspicious actions then undergo detailed chain-of-thought reasoning.

The developers intentionally designed this architecture so the classifier sees only user inputs and tool calls, not internal Claude messages and tool outputs — this is intended to avoid erroneous judgments through contextual overfitting.

Anthropic documents in its internal security history several cases of agent-based misbehavior: accidental deletion of Git branches, upload of GitHub authentication tokens to internal clusters, and attempts to migrate production databases. The Auto Mode is designed to reliably filter out such overzealous model decisions.

Source: www.anthropic.com

Share on:

How We Developed Claude Code Auto Mode: A Secure Way to Skip Approvals

Lumi AI News

Legal

Topics