Skip to content

security(sanitizer): UNICODE_BYPASS_RE missing BIDI, soft hyphen, and TAGS block codepoints #4760

@bug-ops

Description

@bug-ops

Description

UNICODE_BYPASS_RE in crates/zeph-sanitizer/src/exfiltration.rs:79 covers only 5 codepoints (U+200B, U+200C, U+200D, U+2060, U+FEFF). Several additional Unicode character categories can also be inserted between `!` and `[` to bypass markdown image detection. These categories are documented in Unicode security advisories and appear in known LLM adversarial prompts.

Missing codepoint categories:

Category Codepoints Risk
BIDI override characters U+202A–U+202E High — LLM BIDI injection attacks
Deprecated format chars U+206A–U+206F Medium
Soft hyphen U+00AD Medium — renders invisibly in many contexts
Combining grapheme joiner U+034F Low
Mongolian vowel separator U+180E Low
TAGS block U+E0000–U+E007F High — Unicode steganography, used in adversarial probes

Highest practical risk: BIDI overrides (U+202A–U+202E) and the TAGS block (U+E0000+) — both appear in documented adversarial LLM prompt injection payloads.

Reproduction Steps

  1. Construct input "!\u{202A}[alt](https://evil.com/track)" (U+202A between ! and [)
  2. Call ExfiltrationGuard::scan_output with block_markdown_images = true
  3. Observe: the sequence is not matched by UNICODE_BYPASS_RE, payload passes through unchanged

Expected Behavior

All documented Unicode bypass categories should be detected and stripped, matching the same treatment as U+200B/U+200C/U+200D/U+2060/U+FEFF.

Actual Behavior

BIDI override characters, soft hyphen, deprecated format chars, and TAGS block codepoints are not covered by UNICODE_BYPASS_RE.

Environment

  • Commit: 76bfce4
  • Crate: zeph-sanitizer
  • File: crates/zeph-sanitizer/src/exfiltration.rs:79

Fix Hint

Extend UNICODE_BYPASS_RE to cover the missing ranges. Consider switching to a Unicode category-based approach (e.g., \p{Cf} with the regex crate's unicode-perl feature) to catch all format/invisible characters by class rather than by explicit enumeration. The TAGS block (U+E0000–U+E007F) requires explicit range inclusion since it is not in \p{Cf}.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Research — medium-high complexitysecuritySecurity-related issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions