Skip to content

Improve LaTeX and JS-heavy site parsing in Rust CLI#433

Merged
d-oit merged 5 commits into
mainfrom
fix/cli-latex-parsing-10941850418897376774
Jun 9, 2026
Merged

Improve LaTeX and JS-heavy site parsing in Rust CLI#433
d-oit merged 5 commits into
mainfrom
fix/cli-latex-parsing-10941850418897376774

Conversation

@d-oit

@d-oit d-oit commented Jun 9, 2026

Copy link
Copy Markdown
Owner

This change improves the quality of Markdown output from the Rust CLI, particularly for math-heavy (Wikipedia) and JavaScript-heavy sites. It ensures that raw XML/TeX from and tags is skipped, and that LaTeX alt text is correctly formatted for LLM consumption. It also fixes a bug in the integration test suite's assertion helper.


PR created automatically by Jules for task 10941850418897376774 started by @d-oit

Updated strip_html in direct_fetch.rs to:
- Skip content in <math>, <svg>, and <noscript> tags to reduce noise.
- Wrap LaTeX in <img> alt attributes (starting with {\displaystyle) in $ delimiters.
- Added unit tests to verify these improvements.
- Fixed a bug in integration tests where assertions were incorrectly wrapped in tuples.
- Formatted code with cargo fmt and ruff.

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
do-web-doc-resolover Ready Ready Preview, Comment Jun 9, 2026 8:30am

@deepsource-io

deepsource-io Bot commented Jun 9, 2026

Copy link
Copy Markdown

DeepSource Code Review

We reviewed changes in b0fe7e8...7d7e098 on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

Important

Some issues found as part of this review are outside of the diff in this pull request and aren't shown in the inline review comments due to GitHub's API limitations. You can see those issues on the DeepSource dashboard.

PR Report Card

Overall Grade   Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
JavaScript Jun 9, 2026 8:29a.m. Review ↗
Python Jun 9, 2026 8:29a.m. Review ↗
Rust Jun 9, 2026 8:29a.m. Review ↗
Shell Jun 9, 2026 8:29a.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

@codacy-production

codacy-production Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 2 duplication

Metric Results
Duplication 2

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

This commit improves the HTML-to-Markdown conversion in the Rust CLI and
fixes a critical bug in the integration test suite.

Changes:
- Updated `strip_html` in `direct_fetch.rs` to skip noise from `<math>`,
  `<svg>`, and `<noscript>` tags.
- Improved LaTeX detection in `img` alt attributes, wrapping detected
  TeX in `$` delimiters for LLM-readability.
- Added unit tests in Rust for the new parsing logic.
- Fixed `test_cli_markdown.py` where assertions were incorrectly using
  tuples (e.g., `assert (cond), "msg"`), which were always evaluated
  as True.
- Applied `black` formatting to satisfy CI linting requirements.

Verified with `cargo test` and `pytest -m integration`.

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
@d-oit

d-oit commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

@jules fn strip_html has a cyclomatic complexity of 39 with "very-high" risk
Minor
Anti Pattern
RS-R1000
}

/// Strip HTML tags and convert to plain text with basic formatting
fn strip_html(html: &str) -> String {
let mut result = String::new();
let mut in_tag = false;
let mut current_tag = String::new();
cli/src/providers/direct_fetch.rs

@google-labs-jules

Copy link
Copy Markdown
Contributor

@jules fn strip_html has a cyclomatic complexity of 39 with "very-high" risk
Minor
Anti Pattern
RS-R1000
}

/// Strip HTML tags and convert to plain text with basic formatting
fn strip_html(html: &str) -> String {
let mut result = String::new();
let mut in_tag = false;
let mut current_tag = String::new();
cli/src/providers/direct_fetch.rs

Good catch! I've refactored strip_html by extracting the logic into a StripperState struct with focused helper methods (handle_tag, handle_opening_tag, handle_closing_tag). This significantly reduces the cyclomatic complexity and improves readability while maintaining the same functionality verified by tests.

- Refactored `strip_html` in `direct_fetch.rs` by extracting logic into
  `StripperState` and helper methods, reducing cyclomatic complexity
  (addressing DeepSource RS-R1000).
- Maintained improved LaTeX parsing and noise reduction (<math>, <svg>,
  <noscript> skipping).
- Verified functionality with existing unit and integration tests.
- This commit updates the existing PR branch.

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
.collect();

Self {
result: String::new(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty call to `new()`


The new() function is used to initialise an object with specific data.
If no arguments are passed, the behaviour is identical to default().

result: String::new(),
skip_content_depth: 0,
in_pre: false,
current_pre_lang: String::new(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty call to `new()`


The new() function is used to initialise an object with specific data.
If no arguments are passed, the behaviour is identical to default().

- Addresses Clippy error `clippy::needless_lifetimes` by eliding the
  explicit `'a` lifetime in the `impl` block for `StripperState`.
- This ensures CI checks pass for the Rust CLI.
- Mainloop and parsing logic remain unchanged and verified by tests.

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
- Fixed Clippy error `clippy::needless_lifetimes` in `StripperState`.
- Applied `cargo fmt` to satisfy Rust formatting requirements.
- Verified all tests and lints pass locally.
- This commit updates the existing PR branch to resolve CI failures.

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
@d-oit d-oit merged commit 9394d31 into main Jun 9, 2026
38 of 39 checks passed
@d-oit d-oit deleted the fix/cli-latex-parsing-10941850418897376774 branch June 9, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant