Render real table headers (not column letters) + read/flag hidden cells, columns & sheets by arnav2 · Pull Request #13 · knowledgestack/excel-parser

arnav2 · 2026-06-05T05:37:50Z

Motivation

Surfaced from r/RAG feedback on spreadsheet RAG: https://www.reddit.com/r/Rag/comments/1tagfi5/comment/opuih9c/?context=1

Our render_text emitted the Excel column-letter row (| A | B | C |) as the first / header row of every block's Markdown table. Any downstream "find the real header" step — a Markdown parser, pandas, or a header-scoring heuristic — then latched onto the coordinate letters instead of the real column names (Name, Amount, Date). The letters were added to give an agent cell coordinates, but they only provided column letters (no row numbers), so they couldn't even form a full A1 reference while actively breaking header detection.

Separately, hidden rows / columns / sheets were being dropped from the rendered output — silently losing data a user may need, with no signal that anything was hidden.

What changed

1. Real headers, with coordinates moved off the header row

The Markdown grid is now a standard table whose header row is the block's real column names (for TABLE / ASSUMPTIONS_TABLE, mirroring HtmlRenderer's <thead>).
Excel column letters move to a cols: map on the bracket line; a row gutter carries each row's Excel number.
An agent reconstructs a full A1 ref from cols: (name -> letter) + the gutter (row number) — strictly more than the old letter row gave.

Before                                  After
[Sheet1!A1:C4] (table)                  [Sheet1!A1:C4] (table) cols: A=Name, B=Amount, C=Date
| A    | B      | C    |                | row | Name   | Amount | Date    |
|------|--------|------|                |-----|--------|--------|---------|
| Name | Amount | Date |  <- real       | 2   | Item 1 | 200    | 2024-02 |
| ...  | ...    | ...  |     header lost | 3   | Item 2 | 300    | 2024-03 |

2. Read & flag hidden content instead of dropping it

Hidden rows/cols are rendered: [hidden] inline (text: gutter for rows, cols: map for cols) and data-hidden="true" in HTML.
Hidden worksheets are still chunked, marked [hidden sheet] (text) / data-sheet-hidden="true" (HTML).
All hidden state is stored as structured chunk metadata — sheet_hidden, hidden_rows, hidden_cols — and exported through ParseResult.to_json().

[Visible!A1:C4] (table) cols: A=Name, B=Secret [hidden], C=Date
| row        | Name   | Secret | Date    |
| 3 [hidden] | Item 2 | 300    | 2024-03 |

chunk.metadata == {"hidden_rows": [3], "hidden_cols": ["B"]}

How it was tested

TDD red -> green. 6 new tests in tests/test_rendering.py (real-header rendering, hidden row/col extraction + metadata, hidden-sheet metadata + marker, HTML hidden inclusion). Verified each fails without the implementation (source stashed -> 6 red) and passes with it (6 green), so the tests genuinely pin the behavior rather than passing vacuously.
Full suite: 1077 passing, 0 failures (tests/, default markers). End-to-end pipeline files (test_pipeline, test_multi_table_layout, test_formula_handling, test_array_formula_rendering, test_formula_uncached_rendering): 157 passing.
Ruff: clean on all changed files.

Not in this PR / follow-ups

Retrieval recall not yet measured. This changes the default render_text, and the SpreadsheetBench corpus isn't local (make corpus-download + make bench-retrieval, ~40 min). Value-recall is structurally safe-to-better (no values removed; hidden ones added); the only open risk is embedding-rank drift from the added gutter / cols: / [hidden] tokens. Recommend running before merge.
The cols: map uses A=Name entries separated by , — a header literally containing a comma or = could be mis-split by a strict parser. Noted; can harden (quote names) if needed.

🤖 Generated with Claude Code

render_text is now a standard Markdown table whose header row holds the real column names instead of Excel column letters (A|B|C). Column letters move to a `cols:` map on the bracket line and a `row` gutter is added, so an agent can still rebuild a full A1 ref (column name + row number -> B3) without the letters masquerading as headers and defeating downstream header detection. Hidden content is now read and flagged instead of dropped: - hidden rows/cols render inline ([hidden]) in text and carry data-hidden in HTML (previously omitted entirely) - hidden worksheets keep being chunked, marked [hidden sheet] / data-sheet-hidden - all hidden state is stored as structured chunk metadata (sheet_hidden / hidden_rows / hidden_cols) and exported via to_json() Motivation: r/RAG feedback that the column-letter header row defeats real-header detection in spreadsheet RAG pipelines. Tested via TDD (red->green) with 6 new tests; full suite 1077 passing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

arnav2 merged commit 27b7004 into main Jun 5, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Render real table headers (not column letters) + read/flag hidden cells, columns & sheets#13

Render real table headers (not column letters) + read/flag hidden cells, columns & sheets#13
arnav2 merged 1 commit into
mainfrom
arnav2/renderer-header-columns

arnav2 commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arnav2 commented Jun 5, 2026

Motivation

What changed

How it was tested

Not in this PR / follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant