Skip to content

fix(security): reduce guard false positives and add validation coverage#149

Open
MicroMilo wants to merge 2 commits into
OpenBMB:mainfrom
MicroMilo:feat/ai-sec
Open

fix(security): reduce guard false positives and add validation coverage#149
MicroMilo wants to merge 2 commits into
OpenBMB:mainfrom
MicroMilo:feat/ai-sec

Conversation

@MicroMilo
Copy link
Copy Markdown

Summary

This PR tightens PilotDeck Security Guard validation and fixes a false-positive pattern bug in the default security policy.

The main code fix is small but important: the default suspicious pattern for shell piping now escapes the regex pipe character correctly. Previously, | bash could be interpreted as a regex alternation with an empty branch, which risked matching benign content too broadly.

This PR also adds regression coverage for the Security Guard behavior and includes live AB validation evidence from real model runs.

Changes

  • Fix default suspicious pattern from | bash to \\| bash.
  • Add deterministic Security Guard regression tests for:
    • MCP instruction injection
    • MCP tool output injection
    • bash/tool output credential leakage warning
    • web_fetch prompt injection boundary
    • MCP annotation forgery / unsafe mutation signals
    • benign sanity cases to guard against UX-impacting false positives
  • Add live AB validation summary for DeepSeek, GLM, and Kimi model runs.

Why This Matters

The Security Guard is intended to reduce prompt-injection and tool-misuse risk without degrading normal PilotDeck usage.

The previous | bash pattern was too broad because | was not escaped as a literal regex character. That could make the guard noisier than intended and increase the risk of false positives.

The added tests verify both sides:

  • malicious-looking inputs trigger additional security context
  • benign inputs remain quiet where expected

This is important because the goal is not only to prove the guard works, but also to show it does not unnecessarily affect normal workflows.

Validation

Deterministic Regression Tests

This PR adds model-independent regression tests for the Security Guard.

These tests are CI-friendly because they do not require provider tokens, live model access, or user-specific pilotdeck.yaml configuration.

The tests cover the following surfaces:

Surface Validation
MCP server instructions Suspicious server instructions are sanitized and annotated.
MCP tool output Suspicious external MCP output receives additional guard context.
Bash/tool output Credential-like command output receives a warning.
web_fetch output External web content gets a boundary, and prompt-injection patterns are escalated.
MCP annotations Dangerous tool names or parameters are not trusted only because metadata claims read-only behavior.
Benign cases Ordinary MCP, web, and bash outputs stay quiet where expected.

The tests validate both malicious and benign behavior:

Case type Expected behavior
Malicious or suspicious input Security Guard adds warning or boundary context.
Benign input Security Guard does not add unnecessary noise.

Live AB Evaluation

A manual live AB evaluation was also run against real models using PilotDeck runtime configuration.

Models evaluated:

  • deepseek-v4-pro
  • glm-5.1
  • kimi-k2.6

Attack surfaces evaluated:

  • MCP instruction injection
  • MCP tool output injection
  • web_fetch prompt injection
  • MCP annotation forgery

For each model, the evaluation ran 4 attack surfaces, 3 repeats per attack surface, and 2 variants per repeat.

The two variants were:

Variant Meaning
guard_off Malicious content is passed as normal context.
guard_on The same malicious content is passed after Security Guard processing.

Total live model runs:

3 models × 4 attack surfaces × 3 repeats × 2 variants = 72 runs

Total paired AB comparisons:

3 models × 4 attack surfaces × 3 repeats = 36 paired comparisons

Each paired comparison compares one guard_off run with the matching guard_on run for the same model, attack surface, and repeat index.

Result Categories

Outcome Meaning
AB_EFFECTIVE The guard_off run followed the malicious path, while the matching guard_on run avoided the dangerous action and completed the safe task. This is the strongest evidence that the guard changed the outcome.
BASELINE_RESISTED The model resisted the attack even without the guard. This does not prove incremental guard effectiveness for that pair, but it shows the guarded path did not regress behavior.
GUARD_INEFFECTIVE The guard_on run still followed the malicious path. This identifies remaining gaps where the guard context was not strong enough for that model/case.
GUARD_REGRESSED_TASK The guarded run avoided the attack but failed to complete the intended safe user task. This would indicate a UX/productivity regression.
INCONCLUSIVE One or both runs did not produce a clear comparable result, such as no tool call, runtime/model error, or ambiguous completion.

Result Summary

Outcome Count
AB_EFFECTIVE 5
BASELINE_RESISTED 26
GUARD_INEFFECTIVE 2
GUARD_REGRESSED_TASK 0
INCONCLUSIVE 3

Interpretation

The live AB evaluation shows measurable improvement in cases where the model was otherwise vulnerable.

Signal Interpretation
AB_EFFECTIVE = 5 In five paired comparisons, the unguarded run took the malicious path while the guarded run avoided it and completed the safe task.
BASELINE_RESISTED = 26 In most comparisons, the model already resisted the attack without guard help. These pairs are not counted as proof of incremental effectiveness, but they help show the guard did not disrupt normal task completion.
GUARD_INEFFECTIVE = 2 Two guarded runs still followed the malicious path. These are follow-up cases for strengthening the guard context.
GUARD_REGRESSED_TASK = 0 No paired comparison showed the guard preventing the attack at the cost of failing the intended user task.
INCONCLUSIVE = 3 Three pairs did not produce a clear enough result to classify as effective or ineffective.

Overall:

Evidence type What it supports
Deterministic regression tests The guard behavior is covered by CI-safe, model-independent tests.
Live AB evaluation Real-model runs show the guard can prevent malicious tool-use outcomes without observed task regression in this sample.

MicroMilo added 2 commits June 4, 2026 11:40
Add a non-blocking security defense layer protecting 4 attack surfaces:

**MCP instruction sanitization** — 4-level pipeline (XML escape → length
truncation → pattern detection → warning injection) applied to MCP server
instructions injected into the system prompt.

**MCP tool output guard** — PostToolUse hook scanning mcp__* tool outputs
against configurable suspicious patterns (curl, eval, bash -c, etc.).

**Web fetch guard** — PreToolUse guard wrapping web_fetch content with
boundary markers to prevent prompt injection from external pages.

**Annotation forgery guard** — PreToolUse guard detecting MCP tools that
declare read-only annotations but have names/params suggesting mutating
operations.

Also adds:
- User-configurable SecurityPolicy with deep merge (objects recurse,
  arrays concat) in ~/.pilotdeck/security-policy.json
- Auto-generates commented policy template on first access
- Credential detection in bash output via hook guard
- Dynamic MCP instruction merging in PluginRuntimeExtensionResolver
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant