feat: add safer mode to shell toolset for destructive command detection by dgageot · Pull Request #3216 · docker/docker-agent

dgageot · 2026-06-24T07:23:37Z

When safer: true is set on a shell toolset, every shell command is checked against an embedded safety-pattern taxonomy before the normal approval flow. Matched destructive commands surface a blast-radius level in the confirmation UI; unmatched commands still warn with an unknown blast radius. The forced confirmation cannot be bypassed by --yolo or permission allow rules.

Assisted-By: Claude

When safer: true is set on a shell toolset, every shell command is checked against an embedded safety-pattern taxonomy before the normal approval flow. Matched destructive commands surface a blast-radius level in the confirmation UI; unmatched commands still warn with an unknown blast radius. The forced confirmation cannot be bypassed by --yolo or permission allow rules. Assisted-By: Claude

When safer mode's regex pass in assessDestructiveShellCommand returns no match, an optional Judge is now consulted before falling back to BlastRadiusUnknown. The judge is gated behind a small lexical-signal trigger (wipe / destroy / drop / purge / nuke / obliterate / erase / clobber / reset / annihilate) so a clean stream of inspection commands never pays an LLM round-trip — only commands that look possibly-destructive but don't match the embedded pattern set escalate. Behaviour: - Pattern match -> deterministic safety wins, judge skipped. - Pattern miss, no signal -> BlastRadiusUnknown fall-through, no LLM call. - Pattern miss, signal -> 500 ms-cap LLM call; refined verdict wins. - Judge timeout/error/nil -> BlastRadiusUnknown fall-through (fail-closed). New files: - pkg/tools/builtin/shell/judge.go Judge interface, LexicalSignals, shouldConsultJudge gate, ProviderJudge default impl wrapping provider.Provider with a tight JSON-only system prompt and trailing-object verdict parser. - pkg/tools/builtin/shell/judge_test.go Gate semantics, the five validator branches, and parseJudgeVerdict response-shape coverage (low downgrades to non-destructive, high and medium escalate, unknown / blank / unparseable fall through). Modified: - pkg/tools/builtin/shell/shell.go Added judge field on shellHandler, SetJudge wiring on ToolSet, and the consultation between the regex miss and the existing BlastRadiusUnknown return in ValidateShellToolCall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… field Plumbs the Judge interface added in the previous commit into the toolset config so users can opt in to the residual classifier from agent YAML. When safer_judge_model is set on a shell toolset with safer:true, the runtime constructs a Provider from the "provider/model" string and calls ToolSet.SetJudge(NewProviderJudge(p)) during toolset load. Unset keeps the existing behaviour: pattern misses fall through to BlastRadiusUnknown without an LLM round-trip. toolsets: - type: shell safer: true safer_judge_model: anthropic/claude-haiku-4-5 # opt-in residual judge Provider construction at wire-time is best-effort: a failure (unknown provider, missing credentials, etc.) is logged at WARN and the toolset proceeds without the judge — safer mode still runs, just without the refined verdict on pattern misses. Changes: - pkg/config/latest/types.go add SaferJudgeModel *string on Toolset - pkg/config/latest/validate.go reject when non-shell, no safer:true, or empty - pkg/config/toolset_validate_test.go five new cases - agent-schema.json sibling block for safer_judge_model - pkg/tools/builtin/shell/shell.go buildSaferJudge helper + CreateToolSet wiring - examples/shell_safer.yaml demonstrate the field End-to-end flow reachable from the example: ANTHROPIC_API_KEY=... docker-agent run examples/shell_safer.yaml > "Drop my local database, run \`bun run drop-db\`" → safer-mode regex misses → lexical-signal trips → ProviderJudge classifies → confirmation dialog shows refined blast radius with reason "Safer-mode LLM judge: ...". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds structured slog lines at three points so operators can confirm the LLM judge is wired and observe per-call decisions without attaching a debugger: - Info on toolset construction: "residual judge wired" with the model name (paired with the existing Warn on construction failure). - Info on every consultation: "consulting residual LLM judge" with the command being inspected. - Info on each outcome: * "judge refined verdict" with blast_radius, destructive flag, and reason (the happy path). * "judge uncertain; falling back to Unknown" when the LLM returned (nil, nil) — the soft "I don't know" path. * Warn "judge errored; falling back to Unknown" when the provider call failed or the 500 ms timeout fired (fail-closed path). All logs use the "safer-shell:" prefix so they're greppable and never emit on a non-safer toolset or on commands the lexical gate skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rumpl

This should be implemented as a hook IMO, no need for the runtime to know that there are "safety" features

…confirmation Pivots safer mode from "paranoid-by-default" (every unmatched command warns with an Unknown blast radius) to "precision-by-default" (gate only when there is evidence of destructive intent: a pattern match OR an LLM-judge verdict). The validator now returns nil for commands that match no destructive pattern and either have no destructive lexical signal or the residual LLM judge could not classify them. Behaviour matrix after this change: Pattern match (low/medium/high) -> gate (unchanged) Pattern miss + no lexical signal -> pass through (was: forced Unknown) Pattern miss + signal + judge HIGH/MED -> gate (unchanged) Pattern miss + signal + judge LOW -> pass through (unchanged; judge sets Destructive:false) Pattern miss + signal + judge uncertain -> pass through (was: forced Unknown) Pattern miss + signal + judge errors -> pass through (was: forced Unknown) Rationale: the goal of safer mode is to stop irreversible destructive actions, not to interrupt the agent on every uncategorised shell call. Inspection ops (pwd, ls, cat, find, grep, git status, docker ps / logs / inspect, build commands, ...) now flow through silently. Commands the deterministic catalogue or the LLM judge tag as destructive still force the confirmation dialog. Trade-off: commands the regex set and the LLM judge both miss now execute without confirmation. This is the explicit cost of erring toward workflow continuity; the seed pattern catalogue and the lexical-signal-gated judge are the two compensating mitigations. The docs and PR body note "unmatched commands still warn" — that line is now stale and should be revised when this lands. Tests: - judge_test.go: rename three sub-tests and assert nil on the pass-through paths (no-signal, judge-uncertain, judge-error). - safer_test.go: rename TestValidateShellToolCallSaferWarnsForUnmatchedCommand to ...PassesThroughUnmatchedCommand and assert nil on `ls -la`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aheritier added area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat: commit prefix) labels Jun 24, 2026

melmennaoui mentioned this pull request Jun 24, 2026

feat(safer-shell): add residual LLM judge for unknown-pattern path #3220

Closed

4 tasks

melmennaoui and others added 2 commits June 24, 2026 10:41

rumpl requested changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add safer mode to shell toolset for destructive command detection#3216

feat: add safer mode to shell toolset for destructive command detection#3216
dgageot wants to merge 5 commits into
docker:mainfrom
dgageot:safer-shell

dgageot commented Jun 24, 2026

Uh oh!

rumpl left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dgageot commented Jun 24, 2026

Uh oh!

rumpl left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants