Conversation
- URL-to-structure pipeline: Chrome headless → PDF → go-pdfium font analysis → section tree - pdftotext integration for clean text extraction with go-pdfium fallback - `-o` flag to export any document as structured YAML - Heading detection via font size histogram (adapts to each document) - Chrome auto-detection on macOS/Linux/Windows with CHROME_PATH override Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0675b802ef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if candidateIdx < len(candidates) && len(normalized) > 0 { | ||
| candidate := candidates[candidateIdx] | ||
| if isHeadingMatch(normalized, candidate.normalizedText) { | ||
| fontSize = candidate.fontSize |
There was a problem hiding this comment.
Advance heading matcher past unmatched candidates
In buildLinesWithHeadings, matching only against candidates[candidateIdx] and incrementing candidateIdx only on a match means one missing/extra heading can block all later matches. This happens when go-pdfium and pdftotext disagree on an early heading (common with nav or duplicated text), and then subsequent real headings are never promoted above body font size, causing URL parsing to flatten into a single large section.
Useful? React with 👍 / 👎.
| for _, s := range doc.GetAllSections() { | ||
| doc.TotalTokens += s.Tokens |
There was a problem hiding this comment.
Compute URL token totals without double counting
buildSectionsFromLines already returns a tree with cumulative section tokens (via buildTree), so summing every node from doc.GetAllSections() here counts child tokens multiple times whenever headings are nested. That inflates TotalTokens for URL inputs and makes the reported totals in tree/json output incorrect.
Useful? React with 👍 / 👎.
Try parsing semantic HTML headings (h1-h6) before falling back to Chrome/PDF pipeline. Works instantly on SSR doc sites (Mintlify, Docusaurus, etc.) with perfect heading detection. Chrome/PDF remains as fallback for JS-only SPAs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update: HTML-first approach addedPushed a new approach that tries semantic HTML parsing before falling back to Chrome/PDF. Instead of rendering the page to PDF and inferring headings from font sizes, we just fetch the raw HTML and parse Results so far
Why it worksMost documentation sites server-side render for SEO (Mintlify, Docusaurus, GitBook, MkDocs, VitePress, Hugo, Jekyll, ReadTheDocs). The heading hierarchy is right there in the HTML — no Chrome, no PDF, no font analysis needed. What's next
|
Summary
docmap https://example.com/docsrenders via headless Chrome, extracts heading hierarchy from font sizes using go-pdfium (WASM), and displays the same tree output as local files-o file.yamlsaves any document's structure for fast local reuse (docmap file.yaml --search "auth")pdftotextfor clean text when available, falls back to go-pdfium rect-based extractionStatus
Working end-to-end but heading detection from URLs needs refinement — the font-size-to-text matching between pdftotext output and go-pdfium structured data is a known area for improvement. Local file features (
-oflag, YAML export) are solid.Test plan
go test ./...passesgo buildsucceedsdocmap https://docs.discord.com/developers/referencerenders structuredocmap README.md -o out.yaml && docmap out.yamlround-tripsdocmap https://example.com -o saved.yaml && docmap saved.yaml --search "term"works🤖 Generated with Claude Code