Skip to content

Render real table headers (not column letters) + read/flag hidden cells, columns & sheets#13

Merged
arnav2 merged 1 commit into
mainfrom
arnav2/renderer-header-columns
Jun 5, 2026
Merged

Render real table headers (not column letters) + read/flag hidden cells, columns & sheets#13
arnav2 merged 1 commit into
mainfrom
arnav2/renderer-header-columns

Conversation

@arnav2
Copy link
Copy Markdown
Collaborator

@arnav2 arnav2 commented Jun 5, 2026

Motivation

Surfaced from r/RAG feedback on spreadsheet RAG: https://www.reddit.com/r/Rag/comments/1tagfi5/comment/opuih9c/?context=1

Our render_text emitted the Excel column-letter row (| A | B | C |) as the first / header row of every block's Markdown table. Any downstream "find the real header" step — a Markdown parser, pandas, or a header-scoring heuristic — then latched onto the coordinate letters instead of the real column names (Name, Amount, Date). The letters were added to give an agent cell coordinates, but they only provided column letters (no row numbers), so they couldn't even form a full A1 reference while actively breaking header detection.

Separately, hidden rows / columns / sheets were being dropped from the rendered output — silently losing data a user may need, with no signal that anything was hidden.

What changed

1. Real headers, with coordinates moved off the header row

  • The Markdown grid is now a standard table whose header row is the block's real column names (for TABLE / ASSUMPTIONS_TABLE, mirroring HtmlRenderer's <thead>).
  • Excel column letters move to a cols: map on the bracket line; a row gutter carries each row's Excel number.
  • An agent reconstructs a full A1 ref from cols: (name -> letter) + the gutter (row number) — strictly more than the old letter row gave.
Before                                  After
[Sheet1!A1:C4] (table)                  [Sheet1!A1:C4] (table) cols: A=Name, B=Amount, C=Date
| A    | B      | C    |                | row | Name   | Amount | Date    |
|------|--------|------|                |-----|--------|--------|---------|
| Name | Amount | Date |  <- real       | 2   | Item 1 | 200    | 2024-02 |
| ...  | ...    | ...  |     header lost | 3   | Item 2 | 300    | 2024-03 |

2. Read & flag hidden content instead of dropping it

  • Hidden rows/cols are rendered: [hidden] inline (text: gutter for rows, cols: map for cols) and data-hidden="true" in HTML.
  • Hidden worksheets are still chunked, marked [hidden sheet] (text) / data-sheet-hidden="true" (HTML).
  • All hidden state is stored as structured chunk metadatasheet_hidden, hidden_rows, hidden_cols — and exported through ParseResult.to_json().
[Visible!A1:C4] (table) cols: A=Name, B=Secret [hidden], C=Date
| row        | Name   | Secret | Date    |
| 3 [hidden] | Item 2 | 300    | 2024-03 |

chunk.metadata == {"hidden_rows": [3], "hidden_cols": ["B"]}

How it was tested

  • TDD red -> green. 6 new tests in tests/test_rendering.py (real-header rendering, hidden row/col extraction + metadata, hidden-sheet metadata + marker, HTML hidden inclusion). Verified each fails without the implementation (source stashed -> 6 red) and passes with it (6 green), so the tests genuinely pin the behavior rather than passing vacuously.
  • Full suite: 1077 passing, 0 failures (tests/, default markers). End-to-end pipeline files (test_pipeline, test_multi_table_layout, test_formula_handling, test_array_formula_rendering, test_formula_uncached_rendering): 157 passing.
  • Ruff: clean on all changed files.

Not in this PR / follow-ups

  • Retrieval recall not yet measured. This changes the default render_text, and the SpreadsheetBench corpus isn't local (make corpus-download + make bench-retrieval, ~40 min). Value-recall is structurally safe-to-better (no values removed; hidden ones added); the only open risk is embedding-rank drift from the added gutter / cols: / [hidden] tokens. Recommend running before merge.
  • The cols: map uses A=Name entries separated by , — a header literally containing a comma or = could be mis-split by a strict parser. Noted; can harden (quote names) if needed.

🤖 Generated with Claude Code

render_text is now a standard Markdown table whose header row holds the
real column names instead of Excel column letters (A|B|C). Column letters
move to a `cols:` map on the bracket line and a `row` gutter is added, so
an agent can still rebuild a full A1 ref (column name + row number -> B3)
without the letters masquerading as headers and defeating downstream
header detection.

Hidden content is now read and flagged instead of dropped:
- hidden rows/cols render inline ([hidden]) in text and carry data-hidden
  in HTML (previously omitted entirely)
- hidden worksheets keep being chunked, marked [hidden sheet] /
  data-sheet-hidden
- all hidden state is stored as structured chunk metadata (sheet_hidden /
  hidden_rows / hidden_cols) and exported via to_json()

Motivation: r/RAG feedback that the column-letter header row defeats
real-header detection in spreadsheet RAG pipelines.

Tested via TDD (red->green) with 6 new tests; full suite 1077 passing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@arnav2 arnav2 merged commit 27b7004 into main Jun 5, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant