Skip to content

fix(omml): correct LaTeX output for fractions, math operators, and functions#3122

Merged
PeterStaar-IBM merged 8 commits intodocling-project:mainfrom
giulio-leone:fix/omml-latex-conversion-bugs
Mar 25, 2026
Merged

fix(omml): correct LaTeX output for fractions, math operators, and functions#3122
PeterStaar-IBM merged 8 commits intodocling-project:mainfrom
giulio-leone:fix/omml-latex-conversion-bugs

Conversation

@giulio-leone
Copy link
Copy Markdown
Contributor

Summary

Fixes three related correctness bugs in OMML-to-LaTeX conversion (docling/backend/docx/latex/omml.py and latex_dict.py):

Bug A — Fraction raised to a power: missing grouping braces

When <m:sSup> (superscript) has an <m:f> (fraction) as its base, the converter emitted:

\frac{(x-c)}{v}^{2}

which is ambiguous — LaTeX applies ^2 only to the closing brace of the denominator, not the whole fraction.

Fix: Added dedicated do_ssub, do_ssup, and do_ssubsup handler methods that detect complex base expressions (fractions, radicals) and wrap them in grouping braces:

{\frac{(x-c)}{v}}^{2}

Bug B — EN DASH and CIRCUMFLEX escaped as text-mode macros

Characters U+2013 EN DASH and U+005E CIRCUMFLEX inside <m:r><m:t> math runs were converted to \text{\textendash} and \text{\textasciicircum}, producing invalid math-mode LaTeX.

Fix: Added a _MATH_CHAR_MAP that intercepts these characters before the pylatexenc text encoder and maps them to their math-mode equivalents (- and ^).

Bug C — log (and other standard functions) not mapped to LaTeX commands

An <m:func> element with name log fell through to the text fallback, producing italic log instead of upright \log.

Fix: Added 15 missing standard math functions to the FUNC dict: log, ln, exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr.

Closes #3120

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 13, 2026

DCO Check Passed

Thanks @giulio-leone, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 13, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@giulio-leone giulio-leone force-pushed the fix/omml-latex-conversion-bugs branch from 7780ca6 to d58d253 Compare March 13, 2026 05:31
@dolfim-ibm
Copy link
Copy Markdown
Member

@giulio-leone let's add the example documents linked to the issue as tests

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 76.31579% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/docx/latex/omml.py 76.31% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

@giulio-leone
Copy link
Copy Markdown
Contributor Author

@dolfim-ibm Done — added three minimal DOCX test documents, one per bug:

  • omml_frac_superscript.docx — Bug A: fraction as superscript base (validates braces grouping)
  • omml_text_escapes_in_math.docx — Bug B: en-dash and caret inside math runs (validates math-mode operators)
  • omml_func_log.docx — Bug C: log function recognition (validates \log output)

Each file has matching groundtruth (md, json, itxt).

@giulio-leone
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up fix for the CI regression. The earlier OMML change correctly grouped fraction bases, but it also double-wrapped nested <sup>/<sub> containers in existing equations (^{^{2}}, _{_{0}}). I switched that handling to unwrap only inside sSub/sSup/sSubSup, which restores the historical equations.docx output while keeping the fraction-grouping fix intact.

@M-Hassan-Raza
Copy link
Copy Markdown
Contributor

Thanks for tackling these OMML cases. Bug A looks headed in the right direction, but I think bug B is still not fully fixed.

The caret path still seems to end up as x-y\\^2 rather than a real superscript form. From the code, it looks like ^ gets mapped in process_unicode() and then escaped again later, so the operator is still wrong in the final LaTeX. The new omml_text_escapes_in_math snapshot also seems to encode that same output, so I don’t think the regression is actually closed yet.

I’d also add a fixture for the delimiter-wrapped log\\left(x\\right) shape from the issue. The current omml_func_log test seems to prove \\log(x), which is useful, but it doesn’t quite cover the exact form reported in #3120.

PSA: I am new to this coedbase so I could be wrong, in which case please feel free to discard this comment.

@giulio-leone giulio-leone force-pushed the fix/omml-latex-conversion-bugs branch from 887442c to 5c9e5d0 Compare March 15, 2026 16:18
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Hi @dolfim-ibm, @M-Hassan-Raza — thanks for the review! I'll add the example documents from the linked issue as tests and take another look at the caret/superscript handling in process_unicode() to fix bug B properly. Will push an update.

@giulio-leone
Copy link
Copy Markdown
Contributor Author

Thanks @dolfim-ibm @M-Hassan-Raza for the detailed feedback!

I've now:

  1. Replaced all 3 test documents with the real Word files from issue OMML-to-LaTeX conversion produces incorrect output for fractions, math operators, and functions #3120 (the ~37 KB documents from @smroels, not my minimal ~1.2 KB programmatic fixtures)
  2. Fixed Bug B (caret/superscript): escape_latex was re-escaping the ^ that process_unicode had correctly mapped from U+005E. Now do_r restores math-mapped characters after escaping.
    • Before: x - y\^2
    • After: x - y^2
  3. Regenerated all groundtruth files for the new test documents

Bugs A and C were already working correctly with the new test documents:

  • Bug A: ${\frac{(x-c)}{v}}^{2}$ ✅ (fraction properly grouped)
  • Bug C: \log(\left(x\right)) ✅ (\log recognized)

Ready for re-review!

@giulio-leone giulio-leone force-pushed the fix/omml-latex-conversion-bugs branch from f7db987 to c3c487e Compare March 15, 2026 20:44
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Hi team! 👋 The mergify bot indicates this PR requires two reviewers for test updates. @M-Hassan-Raza has already provided valuable feedback which I've addressed. Could a second reviewer (@PeterStaar-IBM, @cau-git, or @ceberam) take a look when convenient? The DCO check is now passing and all groundtruth files have been regenerated. Thank you!

@cau-git
Copy link
Copy Markdown
Member

cau-git commented Mar 17, 2026

@giulio-leone Thanks for these updates. Could you please check why this test is failing?

=========================== short test summary info ============================
FAILED tests/test_backend_msword.py::test_e2e_docx_conversions - AssertionError: export to indented-text failed on tests/data/groundtruth/docling_v2/omml_frac_superscript.docx
assert False
 +  where False = verify_export('item-0 at level 0: unspecified: group _root_\n  item-1 at level 1: section: group header-0\n    item-2 at level 2: section_header: Issue 1: Fraction as superscript base not grouped\n      item-3 at level 3: text: The equation below raises a frac ... }^2 or \\left(\\frac{(x-c)}{v}\\right)^2.\n      item-4 at level 3: formula: {\\frac{(x-c)}{v}}^{2}', ('tests/data/groundtruth/docling_v2/omml_frac_superscript.docx' + '.itxt'), generate=False)
 +    where 'tests/data/groundtruth/docling_v2/omml_frac_superscript.docx' = str(PosixPath('tests/data/groundtruth/docling_v2/omml_frac_superscript.docx'))
= 1 failed, 289 passed, 8 skipped, 2 xfailed, 50 warnings in 520.12s (0:08:40) =
Error: Process completed with exit code 1.```

@giulio-leone giulio-leone force-pushed the fix/omml-latex-conversion-bugs branch from c3c487e to 75dde05 Compare March 21, 2026 04:05
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Rebased this branch onto current main and fixed the remaining contributor-side E2E breakage.

Root cause: the OMML converter changes were still producing the expected output, but two groundtruth .itxt files had been regenerated in the wrong format (markdown-like prose instead of the repo's real indented-text export). That is why tests/test_backend_msword.py::test_e2e_docx_conversions could fail on omml_frac_superscript.docx.itxt even though the converter output itself was correct.

What I changed:

  • regenerated tests/data/groundtruth/docling_v2/omml_frac_superscript.docx.itxt
  • regenerated tests/data/groundtruth/docling_v2/omml_func_log.docx.itxt

Strict verification run from the rebased head (75dde05):

  • .venv/bin/pytest tests/test_backend_msword.py::test_e2e_docx_conversions -q
  • repeated the same command a second time with no code changes in between
  • both passes: 1 passed

Real DOCX branch-vs-main proof on the same three OMML fixtures:

  • omml_frac_superscript.docx
    • branch: formula: {\frac{(x-c)}{v}}^{2}
    • main: formula: \frac{(x-c)}{v}^{2}
  • omml_func_log.docx
    • branch: formula: y = \log(\left(x\right))
    • main: formula: y = log\left(x\right)
    • main also still logs Function not supported, will default to text: log
  • omml_text_escapes_in_math.docx
    • branch: formula: x - y^2
    • main: formula: x \text{ \textendash } y \text{ \textasciicircum } 2

So the branch still carries real converter fixes vs current main; the last failing E2E issue on this PR was the stale/wrong .itxt groundtruth, which is now aligned with the actual export format.

@PeterStaar-IBM
Copy link
Copy Markdown
Member

@giulio-leone Please run this command a few times,

 uv run pre-commit run --all-files

giulio-leone and others added 6 commits March 23, 2026 11:27
…nctions

Fixes three related bugs in OMML-to-LaTeX conversion:

A) Fraction raised to a power now produces correct grouping braces:
   {\frac{(x-c)}{v}}^{2} instead of \frac{(x-c)}{v}^{2}
   Adds dedicated do_ssub/do_ssup/do_ssubsup handlers that wrap
   complex base expressions (fractions, radicals) in braces.

B) EN DASH (U+2013) and CIRCUMFLEX (U+005E) inside math runs are
   now mapped to their math-mode equivalents (- and ^) instead of
   being escaped as \text{\textendash} and \text{\textasciicircum}.

C) Adds missing standard math functions to the FUNC dict: log, ln,
   exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr.
   These now emit proper LaTeX commands (e.g. \log) instead of
   falling back to plain italic text.

Closes #3120

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Add three minimal DOCX files exercising the fixed edge cases:
- omml_frac_superscript.docx: fraction as superscript base (Bug A)
- omml_text_escapes_in_math.docx: en-dash and caret in math runs (Bug B)
- omml_func_log.docx: log function recognition (Bug C)

Each file includes matching groundtruth (md, json, itxt).

Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Bug B fix: prevent escape_latex from re-escaping characters that
process_unicode intentionally mapped to math operators.  The caret
character U+005E inside <m:r><m:t> math runs was being converted
to ^ by _MATH_CHAR_MAP, then immediately re-escaped to \^ by
escape_latex.  Now do_r restores math-mapped chars after escaping.

Result: x - y\^2 → x - y^2 (correct superscript)

Test documents: replace minimal programmatic fixtures (~1.2 KB)
with the real Word documents from issue #3120 reporter (smroels,
~37 KB each).  Regenerate all groundtruth.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Update .itxt to use proper indented-text export format (item hierarchy)
and refresh .json to match current converter output.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
The OMML regression documents were exported into the .itxt fixtures using the
wrong format, so the real DOCX end-to-end check failed even though the rebased
converter output was correct.

Regenerate the two broken indented-text snapshots from the current branch so
the MS Word E2E test verifies the actual converter behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@giulio-leone giulio-leone force-pushed the fix/omml-latex-conversion-bugs branch from 2816c8a to 08001d9 Compare March 23, 2026 10:28
@giulio-leone
Copy link
Copy Markdown
Contributor Author

PR refresh — 2026-03-23

Cherry-picked onto current main (f0e3d1d) — was previously on 4e650af. Branch fix/omml-latex-conversion-bugs force-pushed to fork.

Test validation (double-pass):

  • tests/test_backend_msword.py
  • Pass 1: 19 passed, 1 xfailed, 1 xpassed ✅
  • Pass 2: 19 passed, 1 xfailed, 1 xpassed ✅

All tests pass. PR is ready for review.

@giulio-leone
Copy link
Copy Markdown
Contributor Author

✅ Validation Evidence

Branch: fix/omml-latex-conversion-bugs @ 08001d9
Status: 0 commits behind upstream main

Double-pass test results (strict identical runs, no code changes between passes):

Pass 1: 19 passed, 1 xfailed, 1 xpassed, 1 warning  ✅
Pass 2: 19 passed, 1 xfailed, 1 xpassed, 1 warning  ✅

Test file: tests/test_backend_msword.py

Branch pushed to fork. CI is the authoritative gate.

@dolfim-ibm
Copy link
Copy Markdown
Member

@giulio-leone please apply the DCO fix commit, then we can finalize the PR.

giulio-leone and others added 2 commits March 23, 2026 17:06
Normalize the multiline condition in omml.py to match the repository
ruff-format output so the pre-commit gate stays clean on the refreshed
PR head.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 08001d9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
@giulio-leone
Copy link
Copy Markdown
Contributor Author

Addressed the remaining contributor-side blockers on refreshed head c7a5f2f.

What changed

  • applied the exact ruff-format normalization that the failing code-checks / lint (3.12) job wanted in docling/backend/docx/latex/omml.py
  • added the required DCO remediation commit for 08001d9c5ce1e4c12e31031529b15454d664f85e

Double-pass local gate

Ran these twice back-to-back with no code changes between passes:

  • uv run pre-commit run --all-files
  • uv run pytest tests/test_backend_msword.py -q

Both passes were clean:

  • 19 passed, 1 xfailed, 1 xpassed, 1 warning

Real DOCX proof on the issue fixtures

Using the same three real DOCX fixtures from this branch and loading each with both current main code and the refreshed branch code:

  • omml_frac_superscript.docx
    • branch: {\frac{(x-c)}{v}}^{2}
    • main: \frac{(x-c)}{v}^{2}
  • omml_text_escapes_in_math.docx
    • branch: x - y^2
    • main: x \text{ \textendash } y \text{ \textasciicircum } 2
  • omml_func_log.docx
    • branch: y = \log(\left(x\right))
    • main: y = log\left(x\right)

So the refreshed head keeps the intended converter fixes vs current main, and the remaining PR blockers from lint + DCO are now addressed.

@PeterStaar-IBM PeterStaar-IBM self-requested a review March 23, 2026 16:50
Copy link
Copy Markdown
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giulio-leone for fixing all those details!
Approving the PR to help move this important fix forward.
I just added a question that could be resolved in another PR, if needed.
Also, a comment regarding the tests for the future:

  • it is very helpful that you cover the changes with tests
  • however it would be good to group the same type of test data in a single docx document instead of creating small docs for each edge case. It is easier to maintain and faster to run with our CI/CD checks. We already have equations.docx, where we used to add edge cases.
  • try to add the backend type as a prefix in new test files (in this case, docx_), since the ground truth files they all go in the same directory and in this way they get grouped by backend. We did not do it at the beginning but we try to stick to this pattern recently.
  • try to avoid statements in the test files that can create confusion in the long term (even if it is dummy text for testing). If you write:
    The equation below uses U+2013 EN DASH as minus and U+005E as caret.
    Expected LaTeX: x - y^2
    Docling produces: x \text{ \textendash } y\text{ \textasciicircum }2_
    
    a user/contributor that reads this test file in the future may get confused, since it is no longer what Docling produces.

Comment thread docling/backend/docx/latex/omml.py
@giulio-leone
Copy link
Copy Markdown
Contributor Author

On the broader wrapping question from the latest review: I’m treating this PR as final for the three concrete regressions from #3120.

I have not broadened the change to operators like \sum / \int because I do not yet have a failing DOCX fixture showing those shapes need the same grouping rule, and I’d rather avoid widening the OMML conversion surface speculatively. If we get a reproducible example for those operators, I’m happy to handle that in a focused follow-up PR.

@PeterStaar-IBM PeterStaar-IBM merged commit e36125b into docling-project:main Mar 25, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OMML-to-LaTeX conversion produces incorrect output for fractions, math operators, and functions

7 participants