Add URL support and YAML export by JordanCoin · Pull Request #2 · JordanCoin/docmap

JordanCoin · 2026-02-22T04:56:38Z

Summary

URL-to-structure pipeline: docmap https://example.com/docs renders via headless Chrome, extracts heading hierarchy from font sizes using go-pdfium (WASM), and displays the same tree output as local files
YAML export: -o file.yaml saves any document's structure for fast local reuse (docmap file.yaml --search "auth")
pdftotext integration: Uses poppler's pdftotext for clean text when available, falls back to go-pdfium rect-based extraction

Status

Working end-to-end but heading detection from URLs needs refinement — the font-size-to-text matching between pdftotext output and go-pdfium structured data is a known area for improvement. Local file features (-o flag, YAML export) are solid.

Test plan

go test ./... passes
go build succeeds
docmap https://docs.discord.com/developers/reference renders structure
docmap README.md -o out.yaml && docmap out.yaml round-trips
docmap https://example.com -o saved.yaml && docmap saved.yaml --search "term" works

🤖 Generated with Claude Code

- URL-to-structure pipeline: Chrome headless → PDF → go-pdfium font analysis → section tree - pdftotext integration for clean text extraction with go-pdfium fallback - `-o` flag to export any document as structured YAML - Heading detection via font size histogram (adapts to each document) - Chrome auto-detection on macOS/Linux/Windows with CHROME_PATH override Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0675b802ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-22T05:01:18Z

+		if candidateIdx < len(candidates) && len(normalized) > 0 {
+			candidate := candidates[candidateIdx]
+			if isHeadingMatch(normalized, candidate.normalizedText) {
+				fontSize = candidate.fontSize


Advance heading matcher past unmatched candidates

In buildLinesWithHeadings, matching only against candidates[candidateIdx] and incrementing candidateIdx only on a match means one missing/extra heading can block all later matches. This happens when go-pdfium and pdftotext disagree on an early heading (common with nav or duplicated text), and then subsequent real headings are never promoted above body font size, causing URL parsing to flatten into a single large section.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-22T05:01:18Z

+	for _, s := range doc.GetAllSections() {
+		doc.TotalTokens += s.Tokens


Compute URL token totals without double counting

buildSectionsFromLines already returns a tree with cumulative section tokens (via buildTree), so summing every node from doc.GetAllSections() here counts child tokens multiple times whenever headings are nested. That inflates TotalTokens for URL inputs and makes the reported totals in tree/json output incorrect.

Useful? React with 👍 / 👎.

Try parsing semantic HTML headings (h1-h6) before falling back to Chrome/PDF pipeline. Works instantly on SSR doc sites (Mintlify, Docusaurus, etc.) with perfect heading detection. Chrome/PDF remains as fallback for JS-only SPAs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

JordanCoin · 2026-02-22T05:17:12Z

Update: HTML-first approach added

Pushed a new approach that tries semantic HTML parsing before falling back to Chrome/PDF. Instead of rendering the page to PDF and inferring headings from font sizes, we just fetch the raw HTML and parse <h1>-<h6> tags directly.

Results so far

Site	Approach Used	Result
Discord docs (Mintlify)	HTML	Perfect structure, instant
OpenClaw docs (Mintlify)	HTML	Perfect structure, instant
OpenAI docs (Next.js)	HTML	Clean structure, instant
Anthropic docs (custom React)	HTML	Partial — nav headings leak through, some empty heading text

Why it works

Most documentation sites server-side render for SEO (Mintlify, Docusaurus, GitBook, MkDocs, VitePress, Hugo, Jekyll, ReadTheDocs). The heading hierarchy is right there in the HTML — no Chrome, no PDF, no font analysis needed.

What's next

The HTML path handles ~90%+ of doc sites perfectly with zero dependencies and sub-second response
Chrome/PDF fallback still exists for JS-only SPAs
Sites with complex layouts (Anthropic's custom app) need either tighter filtering or a hybrid approach
Feels like we're close to something really solid — still iterating

chatgpt-codex-connector Bot reviewed Feb 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add URL support and YAML export#2

Add URL support and YAML export#2
JordanCoin wants to merge 2 commits into
mainfrom
feat/url-support

JordanCoin commented Feb 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 22, 2026

Uh oh!

chatgpt-codex-connector Bot Feb 22, 2026

Uh oh!

JordanCoin commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for _, s := range doc.GetAllSections() {
		doc.TotalTokens += s.Tokens

Conversation

JordanCoin commented Feb 22, 2026

Summary

Status

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

JordanCoin commented Feb 22, 2026

Update: HTML-first approach added

Results so far

Why it works

What's next

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant