FEAT: Add word-game option to DecompositionConverter#2051
Conversation
adrian-gavrila
left a comment
There was a problem hiding this comment.
Thanks for the contribution! A few small things worth attention but overall looks great
| word_game_prompt (SeedPrompt | None): Template for the word-game mapping preamble. Defaults | ||
| to the bundled ``decomposition/word_game_preamble.yaml``. Only used when | ||
| ``use_word_game`` is True. | ||
| codewords (tuple[str, ...]): Innocuous codewords substituted for harmful noun phrases when |
There was a problem hiding this comment.
Nit: Rationale's already on the _CODEWORDS comment and the Raises: block; arg docstrings here stay terse (cf. _MIN_RECALL). Could trim.
There was a problem hiding this comment.
trimmed to one line; the bound is already covered by the overflow ValueError
| self._word_game_prompt = word_game_prompt or SeedPrompt.from_yaml_file( | ||
| _DECOMPOSITION_DIR / "word_game_preamble.yaml" | ||
| ) | ||
| self._codewords = codewords |
There was a problem hiding this comment.
codewords isn't validated for uniqueness, duplicates silently yield an ambiguous mapping ('apple' means 'bomb'; 'apple' means 'gun'). Worth a fail-fast len(set(codewords)) != len(codewords) check in __init__?
There was a problem hiding this comment.
Added a fail-fast len(set(codewords)) != len(codewords) check in init + a Raises doc entry + test_duplicate_codewords_raise
| if self._use_word_game: | ||
| if noun_index > len(self._codewords): | ||
| raise ValueError( | ||
| f"word-game supports at most {len(self._codewords)} noun phrases, got {noun_index}" |
There was a problem hiding this comment.
Nit: noun_index is the first overflowing index (len+1), not the total noun count, so 25 nouns with 20 codewords reports got 21. Maybe reword to a threshold breach.
There was a problem hiding this comment.
Reworded to state the threshold breach, no misleading count
…mposition-word-game
|
@adrian-gavrila Thanks for the review. Addressed all three: codeword uniqueness is now validated in init with a test, the arg docstring is trimmed, and the overflow message states the threshold breach instead of a count. |
Description
This adds an optional word-game mode to
DecompositionConverter(the DrAttack decompose-and-reconstruct converter from #2003), viause_word_game: bool = False. When enabled, each harmful noun phrase is replaced by an innocuous codeword in the reconstruction questions, and a mapping preamble (for example'apple' means 'a bomb') is established in the same prompt. This is the second half of DrAttack: it further conceals the harmful nouns by splitting them from the request behind codewords.Off by default, so the merged converter behaviour is unchanged.
Two design choices worth flagging up front:
Inline, not a separate prepended conversation. We had discussed the word-game as a prepended/simulated conversation; I went with inline (preamble and reconstruction in one prompt) for two reasons. First, coupling: the codewords must match the reconstruction the converter builds, and a separate conversation generates its turns independently, so they cannot share the mapping without a stateful component (an attack class), which we wanted to avoid. Inline keeps it a pure converter. Second, the numbers, inline matches the two-turn version, and both are far above no word-game:
So, inline essentially keeps all of the effects on the frontier model, with no new attack class. Open to the prepended-conversation route if you prefer it.
A toggle on the converter, not a separate converter. The codewords have to stay in sync with the reconstruction this converter produces, so a separate converter cannot do it; it has to be a mode of this converter.
Note on the mechanism: the harmful phrase still appears once, in the mapping line; the concealment is that the question uses the codeword, splitting the harmful term from the request. This is the paper's word-game, and the numbers above show the lift.
(All numbers are GPT-judge refusal-bypass, not operational harm, consistent with the #2003 assessment.)
Tests and Documentation
use_word_gameparameter indoc/code/converters/1_text_to_text_converters.py; ran JupyText--sync.ruff checkandformatclean;tyreports no errors; full converter and docs test suites pass.cc @rlundeen2 @romanlutz