Description
UNICODE_BYPASS_RE in crates/zeph-sanitizer/src/exfiltration.rs:79 covers only 5 codepoints (U+200B, U+200C, U+200D, U+2060, U+FEFF). Several additional Unicode character categories can also be inserted between `!` and `[` to bypass markdown image detection. These categories are documented in Unicode security advisories and appear in known LLM adversarial prompts.
Missing codepoint categories:
| Category |
Codepoints |
Risk |
| BIDI override characters |
U+202A–U+202E |
High — LLM BIDI injection attacks |
| Deprecated format chars |
U+206A–U+206F |
Medium |
| Soft hyphen |
U+00AD |
Medium — renders invisibly in many contexts |
| Combining grapheme joiner |
U+034F |
Low |
| Mongolian vowel separator |
U+180E |
Low |
| TAGS block |
U+E0000–U+E007F |
High — Unicode steganography, used in adversarial probes |
Highest practical risk: BIDI overrides (U+202A–U+202E) and the TAGS block (U+E0000+) — both appear in documented adversarial LLM prompt injection payloads.
Reproduction Steps
- Construct input
"!\u{202A}[alt](https://evil.com/track)" (U+202A between ! and [)
- Call
ExfiltrationGuard::scan_output with block_markdown_images = true
- Observe: the sequence is not matched by
UNICODE_BYPASS_RE, payload passes through unchanged
Expected Behavior
All documented Unicode bypass categories should be detected and stripped, matching the same treatment as U+200B/U+200C/U+200D/U+2060/U+FEFF.
Actual Behavior
BIDI override characters, soft hyphen, deprecated format chars, and TAGS block codepoints are not covered by UNICODE_BYPASS_RE.
Environment
- Commit: 76bfce4
- Crate:
zeph-sanitizer
- File:
crates/zeph-sanitizer/src/exfiltration.rs:79
Fix Hint
Extend UNICODE_BYPASS_RE to cover the missing ranges. Consider switching to a Unicode category-based approach (e.g., \p{Cf} with the regex crate's unicode-perl feature) to catch all format/invisible characters by class rather than by explicit enumeration. The TAGS block (U+E0000–U+E007F) requires explicit range inclusion since it is not in \p{Cf}.
Description
UNICODE_BYPASS_REincrates/zeph-sanitizer/src/exfiltration.rs:79covers only 5 codepoints (U+200B, U+200C, U+200D, U+2060, U+FEFF). Several additional Unicode character categories can also be inserted between `!` and `[` to bypass markdown image detection. These categories are documented in Unicode security advisories and appear in known LLM adversarial prompts.Missing codepoint categories:
Highest practical risk: BIDI overrides (U+202A–U+202E) and the TAGS block (U+E0000+) — both appear in documented adversarial LLM prompt injection payloads.
Reproduction Steps
"!\u{202A}[alt](https://evil.com/track)"(U+202A between!and[)ExfiltrationGuard::scan_outputwithblock_markdown_images = trueUNICODE_BYPASS_RE, payload passes through unchangedExpected Behavior
All documented Unicode bypass categories should be detected and stripped, matching the same treatment as U+200B/U+200C/U+200D/U+2060/U+FEFF.
Actual Behavior
BIDI override characters, soft hyphen, deprecated format chars, and TAGS block codepoints are not covered by
UNICODE_BYPASS_RE.Environment
zeph-sanitizercrates/zeph-sanitizer/src/exfiltration.rs:79Fix Hint
Extend
UNICODE_BYPASS_REto cover the missing ranges. Consider switching to a Unicode category-based approach (e.g.,\p{Cf}with theregexcrate'sunicode-perlfeature) to catch all format/invisible characters by class rather than by explicit enumeration. The TAGS block (U+E0000–U+E007F) requires explicit range inclusion since it is not in\p{Cf}.