Skip to content

fix: FileTypeRouter matches literal MIME types exactly so "+" and "." stop being treated as regex#11648

Open
Aarkin7 wants to merge 3 commits into
deepset-ai:mainfrom
Aarkin7:fix/file-type-router-literal-mime-matching
Open

fix: FileTypeRouter matches literal MIME types exactly so "+" and "." stop being treated as regex#11648
Aarkin7 wants to merge 3 commits into
deepset-ai:mainfrom
Aarkin7:fix/file-type-router-literal-mime-matching

Conversation

@Aarkin7

@Aarkin7 Aarkin7 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

FileTypeRouter.init was compiling every entry in mime_types as a regex without re.escape, even though the docstring already promised exact matching for non-regex strings. Two real consequences:

  1. Any +-containing IANA type (image/svg+xml, application/ld+json, every application/+xml and application/+json) silently fell into the unclassified bucket — no error, no warning.
  2. The . in literals like application/pdf acted as a wildcard, so unrelated strings such as applicationXpdf matched the wrong bucket.

Fix: classify each entry in mime_types at init. Strings made up only of IANA-valid characters ([a-zA-Z0-9.+_-] on both sides of /) are compiled with re.escape and matched literally; anything else (e.g. audio/.*) is still compiled as a regex, preserving the documented regex support.
Bucket key remains the user's original string, so output socket names, pipe.connect("router.image/svg+xml", ...), and to_dict / from_dict round-trips are unaffected.

How did you test it?

Added 8 new unit tests in test/components/routers/test_file_router.py:

  • test_literal_mime_with_plus_matches_self (parametrized across 5 +-containing IANA types)
  • test_literal_mime_with_plus_does_not_cross_contaminate
  • test_literal_mime_with_dot_does_not_cross_contaminate
  • test_long_realistic_literal_mime_matches (OOXML)
  • test_explicit_regex_pattern_still_works (regression guard for regex callers)
  • test_to_dict_from_dict_preserves_literal_and_regex_mix
  • test_pipeline_output_socket_name_matches_literal_mime_with_plus (end-to-end Pipeline)
  • test_additional_mimetypes_with_literal_plus (intersection with additional_mimetypes)

Wider verification:

  • test/components/routers/ — 136/136 pass
  • test/core/pipeline/ — 359/359 pass
  • test_multi_file_converter.py (the main downstream consumer of FileTypeRouter) — 7/7 pass
  • hatch run fmt-check clean
  • hatch run test:types on the touched module clean
  • Manual repro of the original three failure modes from the bug report now all pass

Notes for the reviewer

  • The detection heuristic is at the top of haystack/components/routers/file_type_router.py as _LITERAL_MIME_RE. It deliberately excludes the rarely-seen !#$&^ RFC 6838 tokens so a string intended as a regex is never misclassified as a literal, the trade-off is intentional and noted in the inline comment.
  • Bucket key stays the user's original mime_types entry (not pattern.pattern), so the escaped form never leaks to output sockets or to_dict. The pipeline-socket test guards this explicitly.
  • The error message for invalid input was widened from "Invalid regex pattern" to "Invalid MIME type or regex pattern" to reflect that the parameter now formally accepts both. The existing test was updated accordingly.
  • No public API change, no serialization change, the fix aligns the implementation with the docstring's existing promise.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@Aarkin7 Aarkin7 requested a review from a team as a code owner June 15, 2026 18:56
@Aarkin7 Aarkin7 requested review from anakin87 and removed request for a team June 15, 2026 18:56
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

@Aarkin7 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Jun 15, 2026
@anakin87

Copy link
Copy Markdown
Member

While the bug is real, I have a few comments.

The . in literals like application/pdf acted as a wildcard, so unrelated strings such as applicationXpdf matched the wrong bucket.

This example is wrong. "application/vnd.ms-excel" might be a better one.


I'd suggest implementing a simpler solution like this

  # run : exact literal match first, regex only as fallback
  matched = False
  if mime_type:
      for raw, pattern in self.mime_type_patterns:
          if mime_type == raw or pattern.fullmatch(mime_type):
              mime_types[raw].append(source)
              matched = True
              break
  if not matched:
      mime_types["unclassified"].append(source)

WDYT? This works for you?


Another general note: good to add unit tests, but let's try to keep a good coverage without adding many duplicates.

@anakin87 anakin87 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FileTypeRouter silently drops MIME types containing "+" (e.g. image/svg+xml) into "unclassified"

2 participants