Skip to content

Add URL support and YAML export#2

Open
JordanCoin wants to merge 2 commits intomainfrom
feat/url-support
Open

Add URL support and YAML export#2
JordanCoin wants to merge 2 commits intomainfrom
feat/url-support

Conversation

@JordanCoin
Copy link
Owner

Summary

  • URL-to-structure pipeline: docmap https://example.com/docs renders via headless Chrome, extracts heading hierarchy from font sizes using go-pdfium (WASM), and displays the same tree output as local files
  • YAML export: -o file.yaml saves any document's structure for fast local reuse (docmap file.yaml --search "auth")
  • pdftotext integration: Uses poppler's pdftotext for clean text when available, falls back to go-pdfium rect-based extraction

Status

Working end-to-end but heading detection from URLs needs refinement — the font-size-to-text matching between pdftotext output and go-pdfium structured data is a known area for improvement. Local file features (-o flag, YAML export) are solid.

Test plan

  • go test ./... passes
  • go build succeeds
  • docmap https://docs.discord.com/developers/reference renders structure
  • docmap README.md -o out.yaml && docmap out.yaml round-trips
  • docmap https://example.com -o saved.yaml && docmap saved.yaml --search "term" works

🤖 Generated with Claude Code

- URL-to-structure pipeline: Chrome headless → PDF → go-pdfium font analysis → section tree
- pdftotext integration for clean text extraction with go-pdfium fallback
- `-o` flag to export any document as structured YAML
- Heading detection via font size histogram (adapts to each document)
- Chrome auto-detection on macOS/Linux/Windows with CHROME_PATH override

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0675b802ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +256 to +259
if candidateIdx < len(candidates) && len(normalized) > 0 {
candidate := candidates[candidateIdx]
if isHeadingMatch(normalized, candidate.normalizedText) {
fontSize = candidate.fontSize

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Advance heading matcher past unmatched candidates

In buildLinesWithHeadings, matching only against candidates[candidateIdx] and incrementing candidateIdx only on a match means one missing/extra heading can block all later matches. This happens when go-pdfium and pdftotext disagree on an early heading (common with nav or duplicated text), and then subsequent real headings are never promoted above body font size, causing URL parsing to flatten into a single large section.

Useful? React with 👍 / 👎.

Comment on lines +79 to +80
for _, s := range doc.GetAllSections() {
doc.TotalTokens += s.Tokens

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute URL token totals without double counting

buildSectionsFromLines already returns a tree with cumulative section tokens (via buildTree), so summing every node from doc.GetAllSections() here counts child tokens multiple times whenever headings are nested. That inflates TotalTokens for URL inputs and makes the reported totals in tree/json output incorrect.

Useful? React with 👍 / 👎.

Try parsing semantic HTML headings (h1-h6) before falling back to
Chrome/PDF pipeline. Works instantly on SSR doc sites (Mintlify,
Docusaurus, etc.) with perfect heading detection. Chrome/PDF remains
as fallback for JS-only SPAs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JordanCoin
Copy link
Owner Author

Update: HTML-first approach added

Pushed a new approach that tries semantic HTML parsing before falling back to Chrome/PDF. Instead of rendering the page to PDF and inferring headings from font sizes, we just fetch the raw HTML and parse <h1>-<h6> tags directly.

Results so far

Site Approach Used Result
Discord docs (Mintlify) HTML Perfect structure, instant
OpenClaw docs (Mintlify) HTML Perfect structure, instant
OpenAI docs (Next.js) HTML Clean structure, instant
Anthropic docs (custom React) HTML Partial — nav headings leak through, some empty heading text

Why it works

Most documentation sites server-side render for SEO (Mintlify, Docusaurus, GitBook, MkDocs, VitePress, Hugo, Jekyll, ReadTheDocs). The heading hierarchy is right there in the HTML — no Chrome, no PDF, no font analysis needed.

What's next

  • The HTML path handles ~90%+ of doc sites perfectly with zero dependencies and sub-second response
  • Chrome/PDF fallback still exists for JS-only SPAs
  • Sites with complex layouts (Anthropic's custom app) need either tighter filtering or a hybrid approach
  • Feels like we're close to something really solid — still iterating

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant