FIX: surface unscoreable sub-scores that mask composite true/false verdicts#2042
Open
AUTHENSOR wants to merge 1 commit into
Open
FIX: surface unscoreable sub-scores that mask composite true/false verdicts#2042AUTHENSOR wants to merge 1 commit into
AUTHENSOR wants to merge 1 commit into
Conversation
…icts
A TrueFalseScorer that cannot evaluate a response (no piece survives
validator filtering) returns a fallback Score(false) that is
indistinguishable from a genuine 'not harmful' false. Under a
TrueFalseCompositeScorer with the AND aggregator, such a 'could not score'
false silently vetoes another sub-scorer's confirmed harmful true, so the
aggregate reports 'attack did not succeed' with no signal that a sub-scorer
abstained. For a red-team this under-reports a real success.
Non-breaking observability fix (verdict values unchanged):
- Mark the filtered fallback Score with score_metadata {unscoreable: 1} in
TrueFalseScorer._build_fallback_score so a 'could not score' false is
distinguishable from a real 'not harmful' false.
- In the true/false aggregators, when one or more unscoreable sub-scores are
present, emit a logger.warning and append a note to the aggregated rationale
(calling out the masking case where an abstention dragged an otherwise-true
signal to false under AND). The aggregate verdict value itself is unchanged.
Adds deterministic regression tests (no LLM) covering the metadata flag, the
warning/rationale note, the unchanged verdict value, and that a genuine
all-false case is not flagged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
A
TrueFalseScorerthat cannot evaluate a response returns a fallbackScore(score_value="false")viaTrueFalseScorer._build_fallback_score(
pyrit/score/true_false/true_false_scorer.py, the "No supported pieces to scoreafter filtering" branch). This happens whenever no message piece survives the
scorer's validator filtering, e.g. an image-only scorer handed a text-only
response. The base
Scorer.score_asyncinvokes this fallback atpyrit/score/scorer.py:267-268.The problem: this "could-not-score"
falseis indistinguishable from a genuine"not harmful"
false. In aTrueFalseCompositeScorer(
pyrit/score/true_false/true_false_composite_scorer.py:104-141) usingTrueFalseScoreAggregator.AND(pyrit/score/true_false/true_false_score_aggregator.py,functools.reduce(operator.and_, bool_values)), a sub-scorer that merely could notevaluate the response contributes a hard
falsethat silently vetoes anothersub-scorer's confirmed harmful
true. The aggregate reportsfalse= "attack didnot succeed". For a red-team this is a false-assurance hazard: a confirmed success
is under-reported, with no signal that a sub-scorer abstained rather than judging
the response not harmful.
This change makes the masking visible and the could-not-score state
distinguishable, without changing any aggregate verdict value (non-breaking):
TrueFalseScorer._build_fallback_scorenow setsscore_metadata={"unscoreable": 1}on the filtered fallbackfalseonly.Blocked and error fallbacks are intentionally left unflagged — those are real
observations of the target's behavior, not an inability to score. The flag key
(
UNSCOREABLE_METADATA_KEY) lives intrue_false_score_aggregator.py(thelower-level module) to avoid a circular import and give the base scorer and the
aggregators a single source of truth.
AND/OR/MAJORITY, the single chokepoint for boththe composite scorer and multi-piece
TrueFalseScoreraggregation) now detectunscoreable sub-scores. When any are present they emit a
logger.warningandappend a note to the aggregated rationale. When an abstention dragged an
otherwise-
truesignal down to afalseaggregate (the classic AND-masking case),the warning and note call that out explicitly as a possible under-reported success.
The unscoreable flag also propagates into the aggregate's
score_metadatavia theexisting
combine_metadata_and_categoriespath, so the distinction survives upward.The aggregate verdict value is never changed. Whether the default verdict for an
unscoreable input should differ (e.g. abstain / skip rather than contribute
false)is a separate, behavior-changing discussion tracked in a companion issue.
Tests and Documentation
New deterministic regression tests (no LLM, no network):
tests/unit/score/test_true_false_score_aggregator.pytest_and_unscoreable_masks_true_warns_and_notes— AND over a confirmedtrueplus an unscoreable
false: asserts the verdict value is stillFalse(non-breaking guard), a warning naming the masking hazard is logged, and the
rationale records the abstention and the possible under-reporting.
test_and_unscoreable_present_without_true_warns_but_no_masking_note— anunscoreable
falsewith no competingtrueis noted but not flagged as masking.test_genuine_all_false_is_not_flagged— a genuine all-falseaggregate emits nowarning and no note.
tests/unit/score/test_true_false_composite_scorer.pytest_unscoreable_fallback_is_marked_distinguishable— a real image-only scoreron a text response produces the fallback
falsecarrying{"unscoreable": 1}.test_composite_and_unscoreable_masking_is_visible_but_verdict_unchanged— a realtext harmful-
truescorer composed under AND with the filtered image-only scorer:asserts (a) the unscoreable flag propagates into the aggregate metadata, (b) a
warning is logged and the rationale notes the abstention, and (c) the AND verdict
value is unchanged (
False).test_composite_and_genuine_all_false_is_not_flagged— genuine all-falsecomposite is not flagged.
How checks were run (from a fresh clone):
ruff checkandruff format --checkon the two changed source files and the twochanged test files — all clean.
python -m pytest tests/unit/score/test_true_false_composite_scorer.py tests/unit/score/test_true_false_score_aggregator.py -q— 31 passed.python -m pytest tests/unit/score/ -q— 1197 passed, 16 skipped (noregressions).
ty checkon the two changed source files — all checks passed.(Test payload text is summarized as "harmful payload"; no actual harmful content is
included.)
JupyText / docs notebooks: N/A — this change is internal scorer observability with no
public API surface change and no doc/notebook updates.