fix(omml): correct LaTeX output for fractions, math operators, and functions#3122
Conversation
|
✅ DCO Check Passed Thanks @giulio-leone, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
7780ca6 to
d58d253
Compare
|
@giulio-leone let's add the example documents linked to the issue as tests |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@dolfim-ibm Done — added three minimal DOCX test documents, one per bug:
Each file has matching groundtruth (md, json, itxt). |
|
Pushed a follow-up fix for the CI regression. The earlier OMML change correctly grouped fraction bases, but it also double-wrapped nested |
|
Thanks for tackling these OMML cases. Bug A looks headed in the right direction, but I think bug B is still not fully fixed. The caret path still seems to end up as I’d also add a fixture for the delimiter-wrapped PSA: I am new to this coedbase so I could be wrong, in which case please feel free to discard this comment. |
887442c to
5c9e5d0
Compare
|
Hi @dolfim-ibm, @M-Hassan-Raza — thanks for the review! I'll add the example documents from the linked issue as tests and take another look at the caret/superscript handling in |
|
Thanks @dolfim-ibm @M-Hassan-Raza for the detailed feedback! I've now:
Bugs A and C were already working correctly with the new test documents:
Ready for re-review! |
f7db987 to
c3c487e
Compare
|
Hi team! 👋 The mergify bot indicates this PR requires two reviewers for test updates. @M-Hassan-Raza has already provided valuable feedback which I've addressed. Could a second reviewer (@PeterStaar-IBM, @cau-git, or @ceberam) take a look when convenient? The DCO check is now passing and all groundtruth files have been regenerated. Thank you! |
|
@giulio-leone Thanks for these updates. Could you please check why this test is failing? |
c3c487e to
75dde05
Compare
|
Rebased this branch onto current Root cause: the OMML converter changes were still producing the expected output, but two groundtruth What I changed:
Strict verification run from the rebased head (
Real DOCX branch-vs-main proof on the same three OMML fixtures:
So the branch still carries real converter fixes vs current |
|
@giulio-leone Please run this command a few times, |
…nctions
Fixes three related bugs in OMML-to-LaTeX conversion:
A) Fraction raised to a power now produces correct grouping braces:
{\frac{(x-c)}{v}}^{2} instead of \frac{(x-c)}{v}^{2}
Adds dedicated do_ssub/do_ssup/do_ssubsup handlers that wrap
complex base expressions (fractions, radicals) in braces.
B) EN DASH (U+2013) and CIRCUMFLEX (U+005E) inside math runs are
now mapped to their math-mode equivalents (- and ^) instead of
being escaped as \text{\textendash} and \text{\textasciicircum}.
C) Adds missing standard math functions to the FUNC dict: log, ln,
exp, det, gcd, deg, hom, ker, dim, arg, inf, sup, lim, Pr.
These now emit proper LaTeX commands (e.g. \log) instead of
falling back to plain italic text.
Closes #3120
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Add three minimal DOCX files exercising the fixed edge cases: - omml_frac_superscript.docx: fraction as superscript base (Bug A) - omml_text_escapes_in_math.docx: en-dash and caret in math runs (Bug B) - omml_func_log.docx: log function recognition (Bug C) Each file includes matching groundtruth (md, json, itxt). Requested-by: @dolfim-ibm Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Bug B fix: prevent escape_latex from re-escaping characters that process_unicode intentionally mapped to math operators. The caret character U+005E inside <m:r><m:t> math runs was being converted to ^ by _MATH_CHAR_MAP, then immediately re-escaped to \^ by escape_latex. Now do_r restores math-mapped chars after escaping. Result: x - y\^2 → x - y^2 (correct superscript) Test documents: replace minimal programmatic fixtures (~1.2 KB) with the real Word documents from issue #3120 reporter (smroels, ~37 KB each). Regenerate all groundtruth. Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Update .itxt to use proper indented-text export format (item hierarchy) and refresh .json to match current converter output. Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
The OMML regression documents were exported into the .itxt fixtures using the wrong format, so the real DOCX end-to-end check failed even though the rebased converter output was correct. Regenerate the two broken indented-text snapshots from the current branch so the MS Word E2E test verifies the actual converter behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2816c8a to
08001d9
Compare
|
PR refresh — 2026-03-23 Cherry-picked onto current Test validation (double-pass):
All tests pass. PR is ready for review. |
✅ Validation EvidenceBranch: Double-pass test results (strict identical runs, no code changes between passes): Test file: Branch pushed to fork. CI is the authoritative gate. |
|
@giulio-leone please apply the DCO fix commit, then we can finalize the PR. |
Normalize the multiline condition in omml.py to match the repository ruff-format output so the pre-commit gate stays clean on the refreshed PR head. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 08001d9 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
|
Addressed the remaining contributor-side blockers on refreshed head What changed
Double-pass local gateRan these twice back-to-back with no code changes between passes:
Both passes were clean:
Real DOCX proof on the issue fixturesUsing the same three real DOCX fixtures from this branch and loading each with both current
So the refreshed head keeps the intended converter fixes vs current |
There was a problem hiding this comment.
Thanks @giulio-leone for fixing all those details!
Approving the PR to help move this important fix forward.
I just added a question that could be resolved in another PR, if needed.
Also, a comment regarding the tests for the future:
- it is very helpful that you cover the changes with tests
- however it would be good to group the same type of test data in a single
docxdocument instead of creating small docs for each edge case. It is easier to maintain and faster to run with our CI/CD checks. We already haveequations.docx, where we used to add edge cases. - try to add the backend type as a prefix in new test files (in this case,
docx_), since the ground truth files they all go in the same directory and in this way they get grouped by backend. We did not do it at the beginning but we try to stick to this pattern recently. - try to avoid statements in the test files that can create confusion in the long term (even if it is dummy text for testing). If you write:
a user/contributor that reads this test file in the future may get confused, since it is no longer what Docling produces.
The equation below uses U+2013 EN DASH as minus and U+005E as caret. Expected LaTeX: x - y^2 Docling produces: x \text{ \textendash } y\text{ \textasciicircum }2_
|
On the broader wrapping question from the latest review: I’m treating this PR as final for the three concrete regressions from I have not broadened the change to operators like |
Summary
Fixes three related correctness bugs in OMML-to-LaTeX conversion (
docling/backend/docx/latex/omml.pyandlatex_dict.py):Bug A — Fraction raised to a power: missing grouping braces
When
<m:sSup>(superscript) has an<m:f>(fraction) as its base, the converter emitted:\frac{(x-c)}{v}^{2}which is ambiguous — LaTeX applies
^2only to the closing brace of the denominator, not the whole fraction.Fix: Added dedicated
do_ssub,do_ssup, anddo_ssubsuphandler methods that detect complex base expressions (fractions, radicals) and wrap them in grouping braces:{\frac{(x-c)}{v}}^{2}Bug B — EN DASH and CIRCUMFLEX escaped as text-mode macros
Characters U+2013 EN DASH and U+005E CIRCUMFLEX inside
<m:r><m:t>math runs were converted to\text{\textendash}and\text{\textasciicircum}, producing invalid math-mode LaTeX.Fix: Added a
_MATH_CHAR_MAPthat intercepts these characters before thepylatexenctext encoder and maps them to their math-mode equivalents (-and^).Bug C —
log(and other standard functions) not mapped to LaTeX commandsAn
<m:func>element with namelogfell through to the text fallback, producing italicloginstead of upright\log.Fix: Added 15 missing standard math functions to the
FUNCdict:log,ln,exp,det,gcd,deg,hom,ker,dim,arg,inf,sup,lim,Pr.Closes #3120