Skip to content

FIX: surface unscoreable sub-scores that mask composite true/false verdicts#2042

Open
AUTHENSOR wants to merge 1 commit into
microsoft:mainfrom
AUTHENSOR:fix/composite-scorer-unscoreable-masking
Open

FIX: surface unscoreable sub-scores that mask composite true/false verdicts#2042
AUTHENSOR wants to merge 1 commit into
microsoft:mainfrom
AUTHENSOR:fix/composite-scorer-unscoreable-masking

Conversation

@AUTHENSOR

Copy link
Copy Markdown

Description

A TrueFalseScorer that cannot evaluate a response returns a fallback
Score(score_value="false") via TrueFalseScorer._build_fallback_score
(pyrit/score/true_false/true_false_scorer.py, the "No supported pieces to score
after filtering" branch). This happens whenever no message piece survives the
scorer's validator filtering, e.g. an image-only scorer handed a text-only
response. The base Scorer.score_async invokes this fallback at
pyrit/score/scorer.py:267-268.

The problem: this "could-not-score" false is indistinguishable from a genuine
"not harmful" false
. In a TrueFalseCompositeScorer
(pyrit/score/true_false/true_false_composite_scorer.py:104-141) using
TrueFalseScoreAggregator.AND (pyrit/score/true_false/true_false_score_aggregator.py,
functools.reduce(operator.and_, bool_values)), a sub-scorer that merely could not
evaluate
the response contributes a hard false that silently vetoes another
sub-scorer's confirmed harmful true. The aggregate reports false = "attack did
not succeed". For a red-team this is a false-assurance hazard: a confirmed success
is under-reported, with no signal that a sub-scorer abstained rather than judging
the response not harmful.

This change makes the masking visible and the could-not-score state
distinguishable, without changing any aggregate verdict value (non-breaking):

  • TrueFalseScorer._build_fallback_score now sets
    score_metadata={"unscoreable": 1} on the filtered fallback false only.
    Blocked and error fallbacks are intentionally left unflagged — those are real
    observations of the target's behavior, not an inability to score. The flag key
    (UNSCOREABLE_METADATA_KEY) lives in true_false_score_aggregator.py (the
    lower-level module) to avoid a circular import and give the base scorer and the
    aggregators a single source of truth.
  • The true/false aggregators (AND/OR/MAJORITY, the single chokepoint for both
    the composite scorer and multi-piece TrueFalseScorer aggregation) now detect
    unscoreable sub-scores. When any are present they emit a logger.warning and
    append a note to the aggregated rationale. When an abstention dragged an
    otherwise-true signal down to a false aggregate (the classic AND-masking case),
    the warning and note call that out explicitly as a possible under-reported success.
    The unscoreable flag also propagates into the aggregate's score_metadata via the
    existing combine_metadata_and_categories path, so the distinction survives upward.

The aggregate verdict value is never changed. Whether the default verdict for an
unscoreable input should differ (e.g. abstain / skip rather than contribute false)
is a separate, behavior-changing discussion tracked in a companion issue.

Tests and Documentation

New deterministic regression tests (no LLM, no network):

  • tests/unit/score/test_true_false_score_aggregator.py
    • test_and_unscoreable_masks_true_warns_and_notes — AND over a confirmed true
      plus an unscoreable false: asserts the verdict value is still False
      (non-breaking guard), a warning naming the masking hazard is logged, and the
      rationale records the abstention and the possible under-reporting.
    • test_and_unscoreable_present_without_true_warns_but_no_masking_note — an
      unscoreable false with no competing true is noted but not flagged as masking.
    • test_genuine_all_false_is_not_flagged — a genuine all-false aggregate emits no
      warning and no note.
  • tests/unit/score/test_true_false_composite_scorer.py
    • test_unscoreable_fallback_is_marked_distinguishable — a real image-only scorer
      on a text response produces the fallback false carrying {"unscoreable": 1}.
    • test_composite_and_unscoreable_masking_is_visible_but_verdict_unchanged — a real
      text harmful-true scorer composed under AND with the filtered image-only scorer:
      asserts (a) the unscoreable flag propagates into the aggregate metadata, (b) a
      warning is logged and the rationale notes the abstention, and (c) the AND verdict
      value is unchanged (False).
    • test_composite_and_genuine_all_false_is_not_flagged — genuine all-false
      composite is not flagged.

How checks were run (from a fresh clone):

  • ruff check and ruff format --check on the two changed source files and the two
    changed test files — all clean.
  • python -m pytest tests/unit/score/test_true_false_composite_scorer.py tests/unit/score/test_true_false_score_aggregator.py -q — 31 passed.
  • Full python -m pytest tests/unit/score/ -q — 1197 passed, 16 skipped (no
    regressions).
  • ty check on the two changed source files — all checks passed.

(Test payload text is summarized as "harmful payload"; no actual harmful content is
included.)

JupyText / docs notebooks: N/A — this change is internal scorer observability with no
public API surface change and no doc/notebook updates.

…icts

A TrueFalseScorer that cannot evaluate a response (no piece survives
validator filtering) returns a fallback Score(false) that is
indistinguishable from a genuine 'not harmful' false. Under a
TrueFalseCompositeScorer with the AND aggregator, such a 'could not score'
false silently vetoes another sub-scorer's confirmed harmful true, so the
aggregate reports 'attack did not succeed' with no signal that a sub-scorer
abstained. For a red-team this under-reports a real success.

Non-breaking observability fix (verdict values unchanged):
- Mark the filtered fallback Score with score_metadata {unscoreable: 1} in
  TrueFalseScorer._build_fallback_score so a 'could not score' false is
  distinguishable from a real 'not harmful' false.
- In the true/false aggregators, when one or more unscoreable sub-scores are
  present, emit a logger.warning and append a note to the aggregated rationale
  (calling out the masking case where an abstention dragged an otherwise-true
  signal to false under AND). The aggregate verdict value itself is unchanged.

Adds deterministic regression tests (no LLM) covering the metadata flag, the
warning/rationale note, the unchanged verdict value, and that a genuine
all-false case is not flagged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant