Render real table headers (not column letters) + read/flag hidden cells, columns & sheets#13
Merged
Merged
Conversation
render_text is now a standard Markdown table whose header row holds the real column names instead of Excel column letters (A|B|C). Column letters move to a `cols:` map on the bracket line and a `row` gutter is added, so an agent can still rebuild a full A1 ref (column name + row number -> B3) without the letters masquerading as headers and defeating downstream header detection. Hidden content is now read and flagged instead of dropped: - hidden rows/cols render inline ([hidden]) in text and carry data-hidden in HTML (previously omitted entirely) - hidden worksheets keep being chunked, marked [hidden sheet] / data-sheet-hidden - all hidden state is stored as structured chunk metadata (sheet_hidden / hidden_rows / hidden_cols) and exported via to_json() Motivation: r/RAG feedback that the column-letter header row defeats real-header detection in spreadsheet RAG pipelines. Tested via TDD (red->green) with 6 new tests; full suite 1077 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Surfaced from r/RAG feedback on spreadsheet RAG: https://www.reddit.com/r/Rag/comments/1tagfi5/comment/opuih9c/?context=1
Our
render_textemitted the Excel column-letter row (| A | B | C |) as the first / header row of every block's Markdown table. Any downstream "find the real header" step — a Markdown parser, pandas, or a header-scoring heuristic — then latched onto the coordinate letters instead of the real column names (Name,Amount,Date). The letters were added to give an agent cell coordinates, but they only provided column letters (no row numbers), so they couldn't even form a full A1 reference while actively breaking header detection.Separately, hidden rows / columns / sheets were being dropped from the rendered output — silently losing data a user may need, with no signal that anything was hidden.
What changed
1. Real headers, with coordinates moved off the header row
TABLE/ASSUMPTIONS_TABLE, mirroringHtmlRenderer's<thead>).cols:map on the bracket line; arowgutter carries each row's Excel number.cols:(name -> letter) + the gutter (row number) — strictly more than the old letter row gave.2. Read & flag hidden content instead of dropping it
[hidden]inline (text: gutter for rows,cols:map for cols) anddata-hidden="true"in HTML.[hidden sheet](text) /data-sheet-hidden="true"(HTML).sheet_hidden,hidden_rows,hidden_cols— and exported throughParseResult.to_json().How it was tested
tests/test_rendering.py(real-header rendering, hidden row/col extraction + metadata, hidden-sheet metadata + marker, HTML hidden inclusion). Verified each fails without the implementation (source stashed -> 6 red) and passes with it (6 green), so the tests genuinely pin the behavior rather than passing vacuously.tests/, default markers). End-to-end pipeline files (test_pipeline,test_multi_table_layout,test_formula_handling,test_array_formula_rendering,test_formula_uncached_rendering): 157 passing.Not in this PR / follow-ups
render_text, and the SpreadsheetBench corpus isn't local (make corpus-download+make bench-retrieval, ~40 min). Value-recall is structurally safe-to-better (no values removed; hidden ones added); the only open risk is embedding-rank drift from the added gutter /cols:/[hidden]tokens. Recommend running before merge.cols:map usesA=Nameentries separated by,— a header literally containing a comma or=could be mis-split by a strict parser. Noted; can harden (quote names) if needed.🤖 Generated with Claude Code