Skip to content

COMP: Regenerate ShapeLabelMap baselines at full double precision#6521

Open
hjmjohnson wants to merge 1 commit into
InsightSoftwareConsortium:mainfrom
hjmjohnson:bug-shapelabelmap-relative-tolerance
Open

COMP: Regenerate ShapeLabelMap baselines at full double precision#6521
hjmjohnson wants to merge 1 commit into
InsightSoftwareConsortium:mainfrom
hjmjohnson:bug-shapelabelmap-relative-tolerance

Conversation

@hjmjohnson

@hjmjohnson hjmjohnson commented Jun 25, 2026

Copy link
Copy Markdown
Member

Regenerates the ShapeLabelMapFixture "resulting value" baselines at full double precision and tightens most tolerances, instead of loosening them. Highest-volume flaky candidate tracked in #6518 (187 failures / 30 days, dominated by Windows MSVC).

Root cause

The "resulting value" baselines were stored at only 4–5 decimal places. For Perimeter ≈ 28.39 that rounding is ~5e-5, which skews the fixed 1e-4 absolute tolerance window asymmetrically: the true (macOS) value sits only ~5e-6 inside the lower edge, so an MSVC RelWithDebInfo build that differs by just ~5e-6 fails. The computation was always double precision — the stored baselines were the lossy part.

What changed
  • Regenerated the 26 computed baselines at full 17-digit double precision (captured locally at setprecision(17)).
  • Tightened to 1e-5 (10× tighter than the original 1e-4) for the rotation-invariant scalars (Perimeter, EquivalentSphericalPerimeter/Radius, Roundness) and the magnitude-invariant EquivalentEllipsoidDiameter — 22 asserts.
  • Kept 1e-4 on the 4 orientation-dependent positions (Centroid / OrientedBoundingBoxOrigin in the _Direction tests): they traverse the less-stable principal-axes path, and the full-precision baseline alone re-centers their window.
  • Single file; no helper, no relative-tolerance machinery. Independently-verified and exact (1e-99/1e-10) asserts untouched.

All 12 ShapeLabelMapFixture tests pass locally (macOS arm64); clang-format clean. CI (esp. Windows MSVC) confirms whether the tightened 1e-5 holds.

@github-actions github-actions Bot added type:Compiler Compiler support or related warnings type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct area:Filtering Issues affecting the Filtering module labels Jun 25, 2026
@hjmjohnson hjmjohnson marked this pull request as ready for review June 25, 2026 20:03
@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR relaxes selected ShapeLabelMapFixture geometry assertions.

  • Adds a ResultTol helper with an absolute floor and relative term.
  • Applies it to scalar resulting-value checks for perimeter, spherical metrics, and roundness.
  • Applies it to selected vector resulting-value checks for computed geometry points.

Confidence Score: 5/5

The changed test code looks mergeable after tightening the loosened checks that scale with coordinates or large perimeter values.

  • The helper compiles for the current scalar and itk::Point<double, 3> call sites.
  • The changed assertions are limited to tests marked as computed resulting values.
  • The remaining issues are test-coverage gaps, not production runtime failures.

Modules/Filtering/LabelMap/test/itkShapeLabelMapFilterGTest.cxx

Important Files Changed

Filename Overview
Modules/Filtering/LabelMap/test/itkShapeLabelMapFilterGTest.cxx Adds relative tolerances for computed shape-label-map test baselines; scalar checks mostly match the documented intent, while some vector and large-perimeter checks may now mask useful regression signals.

Reviews (1): Last reviewed commit: "COMP: Use relative tolerance for ShapeLa..." | Re-trigger Greptile

itk::MakePoint(10.13655, 4.21035, -25.67227), labelObject->GetCentroid(), 1e-4); // resulting value
ITK_EXPECT_VECTOR_NEAR(itk::MakePoint(10.13655, 4.21035, -25.67227),
labelObject->GetCentroid(),
ResultTol(itk::MakePoint(10.13655, 4.21035, -25.67227))); // resulting value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Coordinate Offset Inflates Tolerance

When the expected point is far from the origin, ResultTol(expectedPoint) scales the allowed centroid error by the absolute world coordinate instead of the object size or spacing. The -25.67227 component makes this assertion accept about 0.0257 of centroid drift, so a geometry regression that shifts the object by hundredths of a physical unit can pass only because this test case is located far from zero.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread Modules/Filtering/LabelMap/test/itkShapeLabelMapFilterGTest.cxx Outdated
@blowekamp

Copy link
Copy Markdown
Member

I think I wrote most of these tests. The computation should be done at double precision in the shape filter, 1e-3 is REALLY high tolerance.

Do you have a link to the failing tests?

In the linked issue I do see mention of matrix instability for principle axises in some cases.

@hjmjohnson hjmjohnson requested a review from dzenanz June 26, 2026 00:45
@hjmjohnson

Copy link
Copy Markdown
Member Author

@blowekamp Here's the failure data from the 30-day CDash sweep (2026-05-25 → 06-26). 187 failed rows, all ShapeLabelMapFixture, across 5 cases. The platform split is the interesting part:

Test case Total Windows macOS Linux
3D_T1x1x1 35 35 0 0
3D_T2x2x2_Spacing 35 35 0 0
3D_T3x2x1 35 35 0 0
3D_T2x2x2_Spacing_Direction 41 35 3 3
3D_T3x2x1_Direction 41 35 3 3

Two distinct failure modes:

  • The three non-Direction cases fail only on Windows (105/105, dominated by Windows11-VS22x64-RelWithDebInfo-Favorite-Remotes) — 0 on macOS/Linux. That points at MSVC optimized-build FP (contraction/reordering in RelWithDebInfo) on the resulting-value asserts (GetPerimeter, GetEquivalentSphericalPerimeter/Radius, GetRoundness), not a double-precision shortfall.
  • The two Direction cases fail cross-platform — Windows and macOS and Linux. These are the direction-dependent geometry (principal axes / oriented bounding box), i.e. the "matrix instability for principal axes" you flagged. This is the genuinely cross-platform divergence.

So your instinct looks right: a blanket 1e-3 relative is probably too loose, and the two groups want different treatment — the Windows-only group is plausibly an MSVC FP-model issue, while the Direction group is the real numeric one.

Links to specific failing instances (CDash)

Non-Direction (Windows):

Direction (cross-platform):

(The macOS/Linux Direction instances are from PR CI runs, e.g. deprecate-vnl-eispack-eigensystem; the Windows rows are mostly the ryzenator.kitware nightlies.)

I couldn't script the exact expected-vs-actual deltas out of the CDash test-output pages (SPA), but I can pull the specific per-assertion numbers from a chosen build, or reproduce locally, if that helps decide the right tolerance (and whether the Direction cases need a separate fix on the principal-axes path rather than a tolerance bump). Tracking in #6518.

@hjmjohnson

Copy link
Copy Markdown
Member Author

@blowekamp Follow-up with precise numbers. I instrumented the two Direction cases and dumped the computed values at full double precision on macOS arm64 (where both currently pass), against the in-test baselines:

3D_T2x2x2_Spacing_Direction (passes locally)

Property Baseline in test Computed (macOS, 17 digits) abs Δ % of 1e-4
Perimeter 28.3919 28.391854811919647 4.52e-5 45%
Centroid.y 4.21035 4.2103580941358141 8.09e-6 8%
EquivalentSphericalPerimeter 34.86751 34.867517413417708 7.41e-6 7%
EquivalentSphericalRadius 1.66573 1.6657337346779293 3.7e-6 4%
OrientedBoundingBoxOrigin.x 9.75747 9.7574735481346888 3.55e-6 4%
Roundness 1.22808 1.2280817031643669 1.7e-6 2%

(full vectors: Centroid [10.136550336022847, 4.2103580941358141, -25.672275551739141], OBBOrigin [9.7574735481346888, 5.3613400676410672, -28.034804130519035], EqEllipDiam [2.4814019635976918, 2.7295421599572824, 5.4590843199146732])

3D_T3x2x1_Direction (passes locally)

Property Baseline in test Computed (macOS, 17 digits) abs Δ % of 1e-4
Centroid.z -14.5249 -14.524925635405358 2.56e-5 26%
Roundness 1.09189 1.0918957468809676 5.75e-6 6%
OrientedBoundingBoxOrigin.x 6.02222 6.0222153120831488 4.69e-6 5%
Centroid.x 5.06906 5.0690644435107242 4.44e-6 4%
Perimeter 14.62414 14.624143852112988 3.85e-6 4%
EquivalentSphericalPerimeter 15.96804 15.96804047389762 4.74e-7 0%

(full vectors: Centroid [5.0690644435107242, -3.2528635005915247, -14.524925635405358], OBBOrigin [6.0222153120831488, -4.4769132554776352, -15.570490372427054])

Takeaway

The baselines are stored at 4–5 decimal places, and that rounding alone consumes up to ~45% of the 1e-4 absolute budget on a platform where the test passes. For Perimeter ≈ 28.39, a 5-significant-digit literal carries ~5e-5 inherent error before any cross-platform FP divergence enters. So 1e-4 absolute is partly absorbing baseline under-precision, and the Windows MSVC builds diverge just enough beyond that to cross it.

You're right that 1e-3 relative is too loose. The cleaner fix is to regenerate these baselines at full double precision (the 17-digit values above) and then use a tight relative tolerance (~1e-5) sized to the genuine cross-platform spread, rather than widening absolute tolerance to paper over the rounding. The numbers above are ready to paste in. Happy to push that as the alternative to #6521 if you prefer it.

The ShapeLabelMapFixture "resulting value" baselines were stored at 4-5
decimal places. For a value like Perimeter ~28.39 that rounding is ~5e-5,
skewing the fixed 1e-4 absolute window enough that MSVC RelWithDebInfo
builds flapped (top candidate in InsightSoftwareConsortium#6518).

Replace the rounded baselines with the full double-precision computed
values, and tighten to 1e-5 for the rotation-invariant scalars (perimeter,
equivalent-sphere perimeter/radius, roundness) and the magnitude-invariant
equivalent-ellipsoid diameter. The orientation-dependent centroid and
oriented-bounding-box origin keep 1e-4: they traverse the less-stable
principal-axes path, and the full-precision baseline re-centers their
tolerance window.
@hjmjohnson hjmjohnson force-pushed the bug-shapelabelmap-relative-tolerance branch from b63c56d to 2ed4092 Compare June 26, 2026 11:39
@hjmjohnson hjmjohnson changed the title COMP: Relative tolerance for ShapeLabelMap resulting-value asserts COMP: Regenerate ShapeLabelMap baselines at full double precision Jun 26, 2026
@hjmjohnson

Copy link
Copy Markdown
Member Author

@blowekamp FYI — I force-pushed a different approach onto this PR to test it on CI. Instead of loosening the tolerance, I regenerated the "resulting value" baselines at full double precision and tightened most of the tolerances.

I originally took the easy "increase tolerance" path because the long-standing acceptance of these small deviations was never really a concern — in practice everyone just re-ran the flaky test and moved on. But your point pushed me to check whether higher-precision baselines are the easier real win, and they look like it.

What was done
  • Regenerated the 26 computed baselines (Perimeter, EquivalentSpherical*, Roundness, EquivalentEllipsoidDiameter, and the _Direction Centroid/OrientedBoundingBoxOrigin) at full 17-digit double precision.
  • Tightened to 1e-5 the rotation-invariant scalars and the magnitude-invariant ellipsoid diameter (22 asserts).
  • Kept 1e-4 on the 4 orientation-dependent positions (Centroid/OBBOrigin in the _Direction tests) — they traverse the less-stable principal-axes path, and the full-precision baseline already re-centers their window.
  • Single file, no relative-tolerance helper.
Why this fixes the flap

The baselines were stored at 4–5 decimals. For Perimeter ≈ 28.39 that rounding is ~5e-5, which skewed the fixed 1e-4 absolute window asymmetrically — the macOS value sits only ~5e-6 inside the lower edge, so the MSVC RelWithDebInfo build only had to differ by ~5e-6 to fail. Full-precision baselines re-center the window and recover that headroom, so a tighter tolerance holds. All 12 ShapeLabelMapFixture tests pass locally (macOS arm64); watching CI to confirm 1e-5 holds on Windows MSVC.

@blowekamp

Copy link
Copy Markdown
Member

@hjmjohnson I am not sure how automated these messages are or how engaged you are with this. But the above corse of action does seem logical to me. If the assumption is that we can compute across platforms with 1e-5 accuracy than the above tolerances and precision of the base line is sufficient.

Second, If the results of the failing test are looked at in most cases this not a tolerance issue:
https://open.cdash.org/tests/2582565204

The ShapeLabelMapFixture.3D_T1x1x1's labelObject->GetPerimeter() has a difference of 0.996. That is clearly not a tolerance issue.

This particular test is JUST one pixel set on a 3D image! What is going on here? It needs further investigation. We should be able to do this well.

On this failing test:
https://open.cdash.org/tests/2606525315

 T:\Dashboard\ITK\Modules\Filtering\LabelMap\test\itkShapeLabelMapFilterGTest.cxx(311): error: The RMS difference between itk::MakeVector(2.0, 2.2, 4.4) and labelObject->GetOrientedBoundingBoxSize() is 3.637333435387074e-05,
  which exceeds 1e-10, where
itk::MakeVector(2.0, 2.2, 4.4) evaluates to [2, 2.2000000000000002, 4.4000000000000004],
labelObject->GetOrientedBoundingBoxSize() evaluates to [2.0000287606804887, 2.2000494488745819, 4.4000263948937581], and
1e-10 evaluates to 1e-10.

This does appear to be a tolerance issue. I saw on a couple other tests with the RMS of vector. This has a custom ITK_EXPECT_VECTOR_NEAR macro where the tolerance should likely be adjusted.

@blowekamp

blowekamp commented Jun 26, 2026

Copy link
Copy Markdown
Member

On the one system I check that this test is failing has ITK_USE_FLOAT_SPACE_PRECISION:BOOL=ON configured.

The failing vector comparison above with the bonding box not being with in the tolerance make sense with this configuration. And relaxing the vector near tolerance is appropriate.

I don't have a complete picture of the reasons for the other test failures.

Adding the full precision is a reasonable change, and an near important digit may have been incorrectly dropped in the current values.

EDIT: Also this system with floating point computation enables has 255 other test failures. So Our tests are not configured to pass when floating point precision is enabled ( nor do I think the tolerances should be relaxed to make them)

@dzenanz

dzenanz commented Jun 26, 2026

Copy link
Copy Markdown
Member

The main value of the build with ITK_USE_FLOAT_SPACE_PRECISION:BOOL=ON is to make sure that option compiles. Test tolerances are of lesser concern.

@blowekamp

Copy link
Copy Markdown
Member

The main value of the build with ITK_USE_FLOAT_SPACE_PRECISION:BOOL=ON is to make sure that option compiles. Test tolerances are of lesser concern.

@dzenanz I don't think have tests failing in an "Expected" section of the dashboard is expected. Perhaps the testing phase of this build could be skipped? Or moved to a non-expected section.

Also this particular test is not on the current build because there is a compilation error.

@dzenanz

dzenanz commented Jun 26, 2026

Copy link
Copy Markdown
Member

I moved this build to "Nightly" group.

@blowekamp blowekamp self-requested a review June 26, 2026 19:17
@hjmjohnson

Copy link
Copy Markdown
Member Author

@blowekamp FYI: I'm unlikely to be able to look more deeply into this until later this weekend. I do review all the code changes and read all the commit messages. I'll be honest here, was focused on making the tests not flaky, I don't think this is indicative of code bugs, just testing environment inconsistency boundary conditions. I was too lazy in trying to keep this test from continuing to infect the review of other PRs by it's failures and CDASH noise.

@blowekamp

blowekamp commented Jun 26, 2026

Copy link
Copy Markdown
Member

@blowekamp FYI: I'm unlikely to be able to look more deeply into this until later this weekend. I do review all the code changes and read all the commit messages. I'll be honest here, was focused on making the tests not flaky, I don't think this is indicative of code bugs, just testing environment inconsistency boundary conditions. I was too lazy in trying to keep this test from continuing to infect the review of other PRs by it's failures and CDASH noise.

Do you have an example CI failure where the FLOAT option was not enabled?

EDIT: Found a couple: https://open.cdash.org/tests/2615238403

They are specifically related to the oriented bounding box computation. The change also is related to eigen/eispack-eigensystem which is what is computation uses. I don't think these tests were "flacky" before the numerics changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Filtering Issues affecting the Filtering module type:Compiler Compiler support or related warnings type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants