Skip to content

Markdown parser fails on common edge cases #245

@saccharin98

Description

@saccharin98

Description

The current MarkdownParser in pageindex/parser/markdown.py has several robustness issues. It silently drops content or produces incorrect results on valid Markdown inputs that are common in real-world documents.

All cases below have been verified against the old parser — 6 out of 7 confirmed failing.

Failing Test Cases

1. Content before the first header is silently dropped

This is an introduction paragraph.
Some important context here.

# Chapter 1
Body text.

Expected: The preamble text should be captured as a node.
Actual: _build_nodes only iterates over headers, so everything before # Chapter 1 is lost. (1 node returned, preamble missing)

2. Headerless Markdown files produce zero nodes

Just some plain text document
with multiple lines of content.
No headers at all.

Expected: At least one node containing the full content.
Actual: _extract_headers returns [], _build_nodes returns [] — the entire document is silently discarded.

3. YAML frontmatter — not a separate bug

---
title: My Document
author: Alice
---

# Introduction
Hello world.

Verified: The frontmatter does NOT leak into node content — it is silently dropped as part of the preamble (same root cause as Case 1). However, the new parser now properly extracts frontmatter into metadata, which is the correct behavior.

4. Setext-style headers are not recognized

Main Title
==========

Some content here.

Sub Section
-----------

More content.

Expected: Two nodes — H1 "Main Title" and H2 "Sub Section".
Actual: Zero headers detected, zero nodes returned. Entire document lost.

5. Tilde code fences are not tracked

# Header

~~~python
# This is a comment, not a header
def foo():
    pass
~~~

# Real Header

Expected: 2 nodes: "Header" and "Real Header".
Actual: 3 nodes — the comment # This is a comment, not a header inside the tilde fence is incorrectly detected as a header.

6. Mismatched fence lengths cause incorrect header detection

Example input (a 4-backtick fence containing 3-backtick lines):

# Before

````python
# Inside 4-backtick fence
```
# Still inside — 3 backticks can't close a 4-backtick fence
```
````

# After

Expected: 2 headers: "Before" and "After".
Actual: 3 headers detected: "Before", "Still inside", "After". The parser uses a simple toggle on any ``` line, so it incorrectly "closes" the 4-backtick fence at the first 3-backtick line.

7. ATX closing hashes are included in title

## My Section ##

Content here.

Expected: Title is "My Section".
Actual: Title is "My Section ##" — the trailing ## is not stripped.

Reproduction Script

Self-contained script that inlines the old parser and runs all 7 cases. Save as test_old_parser.py and run with python test_old_parser.py.

Click to expand test script
"""Test the OLD markdown parser against the 7 edge cases from issue #245."""
import re
import sys
import tempfile
from pathlib import Path
from dataclasses import dataclass


# --- Inline the old parser so we don't need to mess with imports ---

@dataclass
class ContentNode:
    content: str
    tokens: int
    title: str | None = None
    index: int | None = None
    level: int | None = None
    images: list | None = None

@dataclass
class ParsedDocument:
    doc_name: str
    nodes: list
    metadata: dict | None = None

def count_tokens(text, model=None):
    return len(text.split())  # cheap approximation for testing

class OldMarkdownParser:
    """Exact copy of the parser BEFORE the fix (commit 6d29886)."""
    def supported_extensions(self):
        return [".md", ".markdown"]

    def parse(self, file_path, **kwargs):
        path = Path(file_path)
        model = kwargs.get("model")
        with open(path, "r", encoding="utf-8") as f:
            content = f.read()
        lines = content.split("\n")
        headers = self._extract_headers(lines)
        nodes = self._build_nodes(headers, lines, model)
        return ParsedDocument(doc_name=path.stem, nodes=nodes)

    def _extract_headers(self, lines):
        header_pattern = r"^(#{1,6})\s+(.+)$"
        code_block_pattern = r"^```"
        headers = []
        in_code_block = False
        for line_num, line in enumerate(lines, 1):
            stripped = line.strip()
            if re.match(code_block_pattern, stripped):
                in_code_block = not in_code_block
                continue
            if not in_code_block and stripped:
                match = re.match(header_pattern, stripped)
                if match:
                    headers.append({
                        "title": match.group(2).strip(),
                        "level": len(match.group(1)),
                        "line_num": line_num,
                    })
        return headers

    def _build_nodes(self, headers, lines, model):
        nodes = []
        for i, header in enumerate(headers):
            start = header["line_num"] - 1
            end = headers[i + 1]["line_num"] - 1 if i + 1 < len(headers) else len(lines)
            text = "\n".join(lines[start:end]).strip()
            tokens = count_tokens(text, model=model)
            nodes.append(ContentNode(
                content=text,
                tokens=tokens,
                title=header["title"],
                index=header["line_num"],
                level=header["level"],
            ))
        return nodes


# --- Helper ---
def parse_string(parser, content, name="test"):
    with tempfile.NamedTemporaryFile(mode="w", suffix=".md", delete=False) as f:
        f.write(content)
        f.flush()
        return parser.parse(f.name)


# --- Tests ---
parser = OldMarkdownParser()
failures = []
passes = []

# Case 1: Preamble before first header
result = parse_string(parser, """This is an introduction paragraph.
Some important context here.

# Chapter 1
Body text.
""")
preamble_found = any("introduction" in n.content for n in result.nodes)
if preamble_found:
    passes.append("Case 1: Preamble")
else:
    failures.append(
        "Case 1: Preamble before first header -> DROPPED "
        "(got {} nodes, none contain 'introduction')".format(len(result.nodes))
    )

# Case 2: Headerless file
result = parse_string(parser, """Just some plain text document
with multiple lines of content.
No headers at all.
""")
if len(result.nodes) > 0:
    passes.append("Case 2: Headerless file")
else:
    failures.append("Case 2: Headerless file -> 0 nodes returned, entire document LOST")

# Case 3: YAML frontmatter
result = parse_string(parser, """---
title: My Document
author: Alice
tags: [python, markdown]
---

# Introduction
Hello world.
""")
frontmatter_in_content = any("title: My Document" in n.content for n in result.nodes)
if not frontmatter_in_content:
    passes.append("Case 3: Frontmatter (not a separate bug, dropped with preamble)")
else:
    failures.append("Case 3: YAML frontmatter -> leaked into node content")

# Case 4: Setext headers
result = parse_string(parser, """Main Title
==========

Some content here.

Sub Section
-----------

More content.
""")
titles = [n.title for n in result.nodes if n.title]
if "Main Title" in titles and "Sub Section" in titles:
    passes.append("Case 4: Setext headers")
else:
    failures.append(
        "Case 4: Setext headers -> not recognized "
        "(got titles: {}, {} nodes)".format(titles, len(result.nodes))
    )

# Case 5: Tilde fences
result = parse_string(parser, """# Header

~~~python
# This is a comment, not a header
def foo():
    pass
~~~

# Real Header
""")
titles = [n.title for n in result.nodes if n.title]
if "This is a comment, not a header" not in titles:
    passes.append("Case 5: Tilde fences")
else:
    failures.append(
        "Case 5: Tilde fences -> comment inside ~~~ detected as header "
        "(got titles: {})".format(titles)
    )

# Case 6: Mismatched fence lengths
content = """# Before

````python
# Inside 4-backtick fence
```
# Still inside
```
````

# After
"""
result = parse_string(parser, content)
titles = [n.title for n in result.nodes if n.title]
has_fake = "Inside 4-backtick fence" in titles or "Still inside" in titles
if not has_fake:
    passes.append("Case 6: Mismatched fence lengths")
else:
    failures.append(
        "Case 6: Mismatched fence lengths -> fake headers detected "
        "inside code block (got titles: {})".format(titles)
    )

# Case 7: ATX closing hashes
result = parse_string(parser, """## My Section ##

Content here.
""")
if result.nodes and result.nodes[0].title == "My Section":
    passes.append("Case 7: ATX closing hashes")
else:
    actual_title = result.nodes[0].title if result.nodes else "N/A"
    failures.append(
        "Case 7: ATX closing hashes -> title is '{}' "
        "instead of 'My Section'".format(actual_title)
    )

# --- Report ---
print("=" * 60)
print("OLD PARSER EDGE CASE VERIFICATION")
print("=" * 60)
print()
for p in passes:
    print(f"  PASS  {p}")
for f in failures:
    print(f"  FAIL  {f}")
print()
print(f"Result: {len(failures)} failed, {len(passes)} passed out of 7")
sys.exit(1 if failures else 0)

Output:

============================================================
OLD PARSER EDGE CASE VERIFICATION
============================================================

  PASS  Case 3: Frontmatter (not a separate bug, dropped with preamble)
  FAIL  Case 1: Preamble before first header -> DROPPED (got 1 nodes, none contain 'introduction')
  FAIL  Case 2: Headerless file -> 0 nodes returned, entire document LOST
  FAIL  Case 4: Setext headers -> not recognized (got titles: [], 0 nodes)
  FAIL  Case 5: Tilde fences -> comment inside ~~~ detected as header (got titles: ['Header', 'This is a comment, not a header', 'Real Header'])
  FAIL  Case 6: Mismatched fence lengths -> fake headers detected inside code block (got titles: ['Before', 'Still inside', 'After'])
  FAIL  Case 7: ATX closing hashes -> title is 'My Section ##' instead of 'My Section'

Result: 6 failed, 1 passed out of 7

Impact

These issues affect real-world Markdown documents:

  • README files and documentation commonly have preamble text before the first header
  • Technical docs use setext headers and various code fence styles
  • Some Markdown editors add closing hashes to ATX headers
  • A headerless .md file being completely discarded is a data loss bug

Environment

  • pageindex version: 0.3.0.dev1
  • Python: 3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions