feat: add safer mode to shell toolset for destructive command detection#3216
Draft
dgageot wants to merge 5 commits into
Draft
feat: add safer mode to shell toolset for destructive command detection#3216dgageot wants to merge 5 commits into
dgageot wants to merge 5 commits into
Conversation
When safer: true is set on a shell toolset, every shell command is checked against an embedded safety-pattern taxonomy before the normal approval flow. Matched destructive commands surface a blast-radius level in the confirmation UI; unmatched commands still warn with an unknown blast radius. The forced confirmation cannot be bypassed by --yolo or permission allow rules. Assisted-By: Claude
When safer mode's regex pass in assessDestructiveShellCommand returns
no match, an optional Judge is now consulted before falling back to
BlastRadiusUnknown. The judge is gated behind a small lexical-signal
trigger (wipe / destroy / drop / purge / nuke / obliterate / erase /
clobber / reset / annihilate) so a clean stream of inspection
commands never pays an LLM round-trip — only commands that look
possibly-destructive but don't match the embedded pattern set
escalate.
Behaviour:
- Pattern match -> deterministic safety wins, judge skipped.
- Pattern miss, no signal -> BlastRadiusUnknown fall-through, no LLM call.
- Pattern miss, signal -> 500 ms-cap LLM call; refined verdict wins.
- Judge timeout/error/nil -> BlastRadiusUnknown fall-through (fail-closed).
New files:
- pkg/tools/builtin/shell/judge.go
Judge interface, LexicalSignals, shouldConsultJudge gate,
ProviderJudge default impl wrapping provider.Provider with a
tight JSON-only system prompt and trailing-object verdict parser.
- pkg/tools/builtin/shell/judge_test.go
Gate semantics, the five validator branches, and parseJudgeVerdict
response-shape coverage (low downgrades to non-destructive, high
and medium escalate, unknown / blank / unparseable fall through).
Modified:
- pkg/tools/builtin/shell/shell.go
Added judge field on shellHandler, SetJudge wiring on ToolSet,
and the consultation between the regex miss and the existing
BlastRadiusUnknown return in ValidateShellToolCall.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
… field
Plumbs the Judge interface added in the previous commit into the toolset
config so users can opt in to the residual classifier from agent YAML.
When safer_judge_model is set on a shell toolset with safer:true, the
runtime constructs a Provider from the "provider/model" string and
calls ToolSet.SetJudge(NewProviderJudge(p)) during toolset load. Unset
keeps the existing behaviour: pattern misses fall through to
BlastRadiusUnknown without an LLM round-trip.
toolsets:
- type: shell
safer: true
safer_judge_model: anthropic/claude-haiku-4-5 # opt-in residual judge
Provider construction at wire-time is best-effort: a failure (unknown
provider, missing credentials, etc.) is logged at WARN and the toolset
proceeds without the judge — safer mode still runs, just without the
refined verdict on pattern misses.
Changes:
- pkg/config/latest/types.go add SaferJudgeModel *string on Toolset
- pkg/config/latest/validate.go reject when non-shell, no safer:true, or empty
- pkg/config/toolset_validate_test.go five new cases
- agent-schema.json sibling block for safer_judge_model
- pkg/tools/builtin/shell/shell.go buildSaferJudge helper + CreateToolSet wiring
- examples/shell_safer.yaml demonstrate the field
End-to-end flow reachable from the example:
ANTHROPIC_API_KEY=... docker-agent run examples/shell_safer.yaml
> "Drop my local database, run \`bun run drop-db\`"
→ safer-mode regex misses → lexical-signal trips → ProviderJudge
classifies → confirmation dialog shows refined blast radius with
reason "Safer-mode LLM judge: ...".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds structured slog lines at three points so operators can confirm
the LLM judge is wired and observe per-call decisions without
attaching a debugger:
- Info on toolset construction: "residual judge wired" with the
model name (paired with the existing Warn on construction
failure).
- Info on every consultation: "consulting residual LLM judge" with
the command being inspected.
- Info on each outcome:
* "judge refined verdict" with blast_radius, destructive flag,
and reason (the happy path).
* "judge uncertain; falling back to Unknown" when the LLM
returned (nil, nil) — the soft "I don't know" path.
* Warn "judge errored; falling back to Unknown" when the
provider call failed or the 500 ms timeout fired
(fail-closed path).
All logs use the "safer-shell:" prefix so they're greppable and never
emit on a non-safer toolset or on commands the lexical gate skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rumpl
requested changes
Jun 24, 2026
rumpl
left a comment
Member
There was a problem hiding this comment.
This should be implemented as a hook IMO, no need for the runtime to know that there are "safety" features
…confirmation
Pivots safer mode from "paranoid-by-default" (every unmatched command
warns with an Unknown blast radius) to "precision-by-default" (gate
only when there is evidence of destructive intent: a pattern match OR
an LLM-judge verdict). The validator now returns nil for commands
that match no destructive pattern and either have no destructive
lexical signal or the residual LLM judge could not classify them.
Behaviour matrix after this change:
Pattern match (low/medium/high) -> gate (unchanged)
Pattern miss + no lexical signal -> pass through (was: forced Unknown)
Pattern miss + signal + judge HIGH/MED -> gate (unchanged)
Pattern miss + signal + judge LOW -> pass through (unchanged; judge sets Destructive:false)
Pattern miss + signal + judge uncertain -> pass through (was: forced Unknown)
Pattern miss + signal + judge errors -> pass through (was: forced Unknown)
Rationale: the goal of safer mode is to stop irreversible destructive
actions, not to interrupt the agent on every uncategorised shell call.
Inspection ops (pwd, ls, cat, find, grep, git status, docker ps /
logs / inspect, build commands, ...) now flow through silently.
Commands the deterministic catalogue or the LLM judge tag as
destructive still force the confirmation dialog.
Trade-off: commands the regex set and the LLM judge both miss now
execute without confirmation. This is the explicit cost of erring
toward workflow continuity; the seed pattern catalogue and the
lexical-signal-gated judge are the two compensating mitigations. The
docs and PR body note "unmatched commands still warn" — that line is
now stale and should be revised when this lands.
Tests:
- judge_test.go: rename three sub-tests and assert nil on the
pass-through paths (no-signal, judge-uncertain, judge-error).
- safer_test.go: rename TestValidateShellToolCallSaferWarnsForUnmatchedCommand
to ...PassesThroughUnmatchedCommand and assert nil on `ls -la`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When safer: true is set on a shell toolset, every shell command is checked against an embedded safety-pattern taxonomy before the normal approval flow. Matched destructive commands surface a blast-radius level in the confirmation UI; unmatched commands still warn with an unknown blast radius. The forced confirmation cannot be bypassed by --yolo or permission allow rules.
Assisted-By: Claude