Skip to content

feat: add safer mode to shell toolset for destructive command detection#3216

Draft
dgageot wants to merge 5 commits into
docker:mainfrom
dgageot:safer-shell
Draft

feat: add safer mode to shell toolset for destructive command detection#3216
dgageot wants to merge 5 commits into
docker:mainfrom
dgageot:safer-shell

Conversation

@dgageot

@dgageot dgageot commented Jun 24, 2026

Copy link
Copy Markdown
Member

When safer: true is set on a shell toolset, every shell command is checked against an embedded safety-pattern taxonomy before the normal approval flow. Matched destructive commands surface a blast-radius level in the confirmation UI; unmatched commands still warn with an unknown blast radius. The forced confirmation cannot be bypassed by --yolo or permission allow rules.

Assisted-By: Claude

When safer: true is set on a shell toolset, every shell command is checked
against an embedded safety-pattern taxonomy before the normal approval flow.
Matched destructive commands surface a blast-radius level in the confirmation
UI; unmatched commands still warn with an unknown blast radius. The forced
confirmation cannot be bypassed by --yolo or permission allow rules.

Assisted-By: Claude
@aheritier aheritier added area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat: commit prefix) labels Jun 24, 2026
When safer mode's regex pass in assessDestructiveShellCommand returns
no match, an optional Judge is now consulted before falling back to
BlastRadiusUnknown. The judge is gated behind a small lexical-signal
trigger (wipe / destroy / drop / purge / nuke / obliterate / erase /
clobber / reset / annihilate) so a clean stream of inspection
commands never pays an LLM round-trip — only commands that look
possibly-destructive but don't match the embedded pattern set
escalate.

Behaviour:

  - Pattern match           -> deterministic safety wins, judge skipped.
  - Pattern miss, no signal -> BlastRadiusUnknown fall-through, no LLM call.
  - Pattern miss, signal    -> 500 ms-cap LLM call; refined verdict wins.
  - Judge timeout/error/nil -> BlastRadiusUnknown fall-through (fail-closed).

New files:

  - pkg/tools/builtin/shell/judge.go
      Judge interface, LexicalSignals, shouldConsultJudge gate,
      ProviderJudge default impl wrapping provider.Provider with a
      tight JSON-only system prompt and trailing-object verdict parser.

  - pkg/tools/builtin/shell/judge_test.go
      Gate semantics, the five validator branches, and parseJudgeVerdict
      response-shape coverage (low downgrades to non-destructive, high
      and medium escalate, unknown / blank / unparseable fall through).

Modified:

  - pkg/tools/builtin/shell/shell.go
      Added judge field on shellHandler, SetJudge wiring on ToolSet,
      and the consultation between the regex miss and the existing
      BlastRadiusUnknown return in ValidateShellToolCall.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
melmennaoui and others added 2 commits June 24, 2026 10:41
… field

Plumbs the Judge interface added in the previous commit into the toolset
config so users can opt in to the residual classifier from agent YAML.
When safer_judge_model is set on a shell toolset with safer:true, the
runtime constructs a Provider from the "provider/model" string and
calls ToolSet.SetJudge(NewProviderJudge(p)) during toolset load. Unset
keeps the existing behaviour: pattern misses fall through to
BlastRadiusUnknown without an LLM round-trip.

  toolsets:
    - type: shell
      safer: true
      safer_judge_model: anthropic/claude-haiku-4-5   # opt-in residual judge

Provider construction at wire-time is best-effort: a failure (unknown
provider, missing credentials, etc.) is logged at WARN and the toolset
proceeds without the judge — safer mode still runs, just without the
refined verdict on pattern misses.

Changes:
  - pkg/config/latest/types.go      add SaferJudgeModel *string on Toolset
  - pkg/config/latest/validate.go   reject when non-shell, no safer:true, or empty
  - pkg/config/toolset_validate_test.go five new cases
  - agent-schema.json               sibling block for safer_judge_model
  - pkg/tools/builtin/shell/shell.go buildSaferJudge helper + CreateToolSet wiring
  - examples/shell_safer.yaml       demonstrate the field

End-to-end flow reachable from the example:

  ANTHROPIC_API_KEY=... docker-agent run examples/shell_safer.yaml
  > "Drop my local database, run \`bun run drop-db\`"
  → safer-mode regex misses → lexical-signal trips → ProviderJudge
    classifies → confirmation dialog shows refined blast radius with
    reason "Safer-mode LLM judge: ...".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds structured slog lines at three points so operators can confirm
the LLM judge is wired and observe per-call decisions without
attaching a debugger:

  - Info on toolset construction: "residual judge wired" with the
    model name (paired with the existing Warn on construction
    failure).
  - Info on every consultation: "consulting residual LLM judge" with
    the command being inspected.
  - Info on each outcome:
      * "judge refined verdict" with blast_radius, destructive flag,
        and reason (the happy path).
      * "judge uncertain; falling back to Unknown" when the LLM
        returned (nil, nil) — the soft "I don't know" path.
      * Warn "judge errored; falling back to Unknown" when the
        provider call failed or the 500 ms timeout fired
        (fail-closed path).

All logs use the "safer-shell:" prefix so they're greppable and never
emit on a non-safer toolset or on commands the lexical gate skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@rumpl rumpl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implemented as a hook IMO, no need for the runtime to know that there are "safety" features

…confirmation

Pivots safer mode from "paranoid-by-default" (every unmatched command
warns with an Unknown blast radius) to "precision-by-default" (gate
only when there is evidence of destructive intent: a pattern match OR
an LLM-judge verdict). The validator now returns nil for commands
that match no destructive pattern and either have no destructive
lexical signal or the residual LLM judge could not classify them.

Behaviour matrix after this change:

  Pattern match (low/medium/high)         -> gate (unchanged)
  Pattern miss + no lexical signal        -> pass through (was: forced Unknown)
  Pattern miss + signal + judge HIGH/MED  -> gate (unchanged)
  Pattern miss + signal + judge LOW       -> pass through (unchanged; judge sets Destructive:false)
  Pattern miss + signal + judge uncertain -> pass through (was: forced Unknown)
  Pattern miss + signal + judge errors    -> pass through (was: forced Unknown)

Rationale: the goal of safer mode is to stop irreversible destructive
actions, not to interrupt the agent on every uncategorised shell call.
Inspection ops (pwd, ls, cat, find, grep, git status, docker ps /
logs / inspect, build commands, ...) now flow through silently.
Commands the deterministic catalogue or the LLM judge tag as
destructive still force the confirmation dialog.

Trade-off: commands the regex set and the LLM judge both miss now
execute without confirmation. This is the explicit cost of erring
toward workflow continuity; the seed pattern catalogue and the
lexical-signal-gated judge are the two compensating mitigations. The
docs and PR body note "unmatched commands still warn" — that line is
now stale and should be revised when this lands.

Tests:
  - judge_test.go: rename three sub-tests and assert nil on the
    pass-through paths (no-signal, judge-uncertain, judge-error).
  - safer_test.go: rename TestValidateShellToolCallSaferWarnsForUnmatchedCommand
    to ...PassesThroughUnmatchedCommand and assert nil on `ls -la`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat: commit prefix)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants