Skip to content

fix: validate block_sizes and split_overlap in HierarchicalDocumentSplitter#11688

Open
Kunal-Somani wants to merge 2 commits into
deepset-ai:mainfrom
Kunal-Somani:fix/hierarchical-splitter-validation
Open

fix: validate block_sizes and split_overlap in HierarchicalDocumentSplitter#11688
Kunal-Somani wants to merge 2 commits into
deepset-ai:mainfrom
Kunal-Somani:fix/hierarchical-splitter-validation

Conversation

@Kunal-Somani

Copy link
Copy Markdown
Contributor

Related Issues

  • No related issue (discovered while reviewing validation parity across preprocessor/splitter components)

Proposed Changes:

HierarchicalDocumentSplitter had two validation gaps that produced either silent wrong behavior or a confusing error:

  1. Empty block_sizes: Constructing HierarchicalDocumentSplitter(block_sizes=set()) succeeded with no error. Calling run() then silently returned the input document unsplit, with no warning that nothing was split.

  2. split_overlap too large: Constructing HierarchicalDocumentSplitter(block_sizes={5, 3}, split_overlap=4) raised: ValueError: split_overlap must be less than split_length.

This error originates from the internal DocumentSplitter that HierarchicalDocumentSplitter builds per block size. It references split_length, a parameter name that does not exist on HierarchicalDocumentSplitter's public API, and gives no indication of which block_size triggered the failure.

I fixed this by adding explicit validation in HierarchicalDocumentSplitter.__init__, before delegating to the internal DocumentSplitter instances:

  • Raises ValueError if block_sizes is empty
  • Raises ValueError if split_overlap is negative
  • Raises ValueError if split_overlap >= min(block_sizes), with a message that uses HierarchicalDocumentSplitter's own parameter names and states the exact smallest block size that's the problem

How did you test it?

  • Reproduced both issues manually against the unpatched code to confirm they were real before writing any fix
  • Ran the full preprocessor test suite locally: PYTHONPATH="" hatch run test:unit test/components/preprocessors/test_hierarchical_doc_splitter.py - 17 passed (11 existing + 6 new)
  • Ran type checks: hatch run test:types - no issues in 365 source files
  • Ran formatter: hatch run fmt - auto-fixed one formatting issue, all checks now pass
  • Confirmed 100% statement coverage on the modified file

Notes for the reviewer

This is not a breaking change for any valid existing usage - only inputs that previously either silently did nothing or raised a confusing, implementation-leaking error now raise a clear, actionable ValueError. No upgrade: note was added to the release note for this reason; it's filed under fixes: only.

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@Kunal-Somani Kunal-Somani requested a review from a team as a code owner June 19, 2026 11:02
@Kunal-Somani Kunal-Somani requested review from sjrl and removed request for a team June 19, 2026 11:02
@vercel

vercel Bot commented Jun 19, 2026

Copy link
Copy Markdown

@Kunal-Somani is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@github-actions github-actions Bot added topic:tests type:documentation Improvements on the docs labels Jun 19, 2026
@sjrl sjrl self-assigned this Jun 19, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/components/preprocessors
  hierarchical_document_splitter.py
Project Total  

This report was generated by python-coverage-comment-action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants