COMP: Regenerate ShapeLabelMap baselines at full double precision#6521
COMP: Regenerate ShapeLabelMap baselines at full double precision#6521hjmjohnson wants to merge 1 commit into
Conversation
|
| Filename | Overview |
|---|---|
| Modules/Filtering/LabelMap/test/itkShapeLabelMapFilterGTest.cxx | Adds relative tolerances for computed shape-label-map test baselines; scalar checks mostly match the documented intent, while some vector and large-perimeter checks may now mask useful regression signals. |
Reviews (1): Last reviewed commit: "COMP: Use relative tolerance for ShapeLa..." | Re-trigger Greptile
| itk::MakePoint(10.13655, 4.21035, -25.67227), labelObject->GetCentroid(), 1e-4); // resulting value | ||
| ITK_EXPECT_VECTOR_NEAR(itk::MakePoint(10.13655, 4.21035, -25.67227), | ||
| labelObject->GetCentroid(), | ||
| ResultTol(itk::MakePoint(10.13655, 4.21035, -25.67227))); // resulting value |
There was a problem hiding this comment.
Coordinate Offset Inflates Tolerance
When the expected point is far from the origin, ResultTol(expectedPoint) scales the allowed centroid error by the absolute world coordinate instead of the object size or spacing. The -25.67227 component makes this assertion accept about 0.0257 of centroid drift, so a geometry regression that shifts the object by hundredths of a physical unit can pass only because this test case is located far from zero.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
I think I wrote most of these tests. The computation should be done at double precision in the shape filter, 1e-3 is REALLY high tolerance. Do you have a link to the failing tests? In the linked issue I do see mention of matrix instability for principle axises in some cases. |
|
@blowekamp Here's the failure data from the 30-day CDash sweep (2026-05-25 → 06-26). 187 failed rows, all
Two distinct failure modes:
So your instinct looks right: a blanket Links to specific failing instances (CDash)Non-
(The macOS/Linux I couldn't script the exact expected-vs-actual deltas out of the CDash test-output pages (SPA), but I can pull the specific per-assertion numbers from a chosen build, or reproduce locally, if that helps decide the right tolerance (and whether the |
|
@blowekamp Follow-up with precise numbers. I instrumented the two
(full vectors: Centroid
(full vectors: Centroid TakeawayThe baselines are stored at 4–5 decimal places, and that rounding alone consumes up to ~45% of the You're right that |
The ShapeLabelMapFixture "resulting value" baselines were stored at 4-5 decimal places. For a value like Perimeter ~28.39 that rounding is ~5e-5, skewing the fixed 1e-4 absolute window enough that MSVC RelWithDebInfo builds flapped (top candidate in InsightSoftwareConsortium#6518). Replace the rounded baselines with the full double-precision computed values, and tighten to 1e-5 for the rotation-invariant scalars (perimeter, equivalent-sphere perimeter/radius, roundness) and the magnitude-invariant equivalent-ellipsoid diameter. The orientation-dependent centroid and oriented-bounding-box origin keep 1e-4: they traverse the less-stable principal-axes path, and the full-precision baseline re-centers their tolerance window.
b63c56d to
2ed4092
Compare
|
@blowekamp FYI — I force-pushed a different approach onto this PR to test it on CI. Instead of loosening the tolerance, I regenerated the "resulting value" baselines at full I originally took the easy "increase tolerance" path because the long-standing acceptance of these small deviations was never really a concern — in practice everyone just re-ran the flaky test and moved on. But your point pushed me to check whether higher-precision baselines are the easier real win, and they look like it. What was done
Why this fixes the flapThe baselines were stored at 4–5 decimals. For |
|
@hjmjohnson I am not sure how automated these messages are or how engaged you are with this. But the above corse of action does seem logical to me. If the assumption is that we can compute across platforms with 1e-5 accuracy than the above tolerances and precision of the base line is sufficient. Second, If the results of the failing test are looked at in most cases this not a tolerance issue: The ShapeLabelMapFixture.3D_T1x1x1's This particular test is JUST one pixel set on a 3D image! What is going on here? It needs further investigation. We should be able to do this well. On this failing test: This does appear to be a tolerance issue. I saw on a couple other tests with the RMS of vector. This has a custom ITK_EXPECT_VECTOR_NEAR macro where the tolerance should likely be adjusted. |
|
On the one system I check that this test is failing has The failing vector comparison above with the bonding box not being with in the tolerance make sense with this configuration. And relaxing the vector near tolerance is appropriate. I don't have a complete picture of the reasons for the other test failures. Adding the full precision is a reasonable change, and an near important digit may have been incorrectly dropped in the current values. EDIT: Also this system with floating point computation enables has 255 other test failures. So Our tests are not configured to pass when floating point precision is enabled ( nor do I think the tolerances should be relaxed to make them) |
|
The main value of the build with |
@dzenanz I don't think have tests failing in an "Expected" section of the dashboard is expected. Perhaps the testing phase of this build could be skipped? Or moved to a non-expected section. Also this particular test is not on the current build because there is a compilation error. |
|
I moved this build to "Nightly" group. |
|
@blowekamp FYI: I'm unlikely to be able to look more deeply into this until later this weekend. I do review all the code changes and read all the commit messages. I'll be honest here, was focused on making the tests not flaky, I don't think this is indicative of code bugs, just testing environment inconsistency boundary conditions. I was too lazy in trying to keep this test from continuing to infect the review of other PRs by it's failures and CDASH noise. |
Do you have an example CI failure where the FLOAT option was not enabled? EDIT: Found a couple: https://open.cdash.org/tests/2615238403 They are specifically related to the oriented bounding box computation. The change also is related to eigen/eispack-eigensystem which is what is computation uses. I don't think these tests were "flacky" before the numerics changes. |
Regenerates the
ShapeLabelMapFixture"resulting value" baselines at fulldoubleprecision and tightens most tolerances, instead of loosening them. Highest-volume flaky candidate tracked in #6518 (187 failures / 30 days, dominated by Windows MSVC).Root cause
The "resulting value" baselines were stored at only 4–5 decimal places. For
Perimeter ≈ 28.39that rounding is ~5e-5, which skews the fixed1e-4absolute tolerance window asymmetrically: the true (macOS) value sits only ~5e-6 inside the lower edge, so an MSVC RelWithDebInfo build that differs by just ~5e-6 fails. The computation was alwaysdoubleprecision — the stored baselines were the lossy part.What changed
doubleprecision (captured locally atsetprecision(17)).1e-5(10× tighter than the original1e-4) for the rotation-invariant scalars (Perimeter,EquivalentSphericalPerimeter/Radius,Roundness) and the magnitude-invariantEquivalentEllipsoidDiameter— 22 asserts.1e-4on the 4 orientation-dependent positions (Centroid/OrientedBoundingBoxOriginin the_Directiontests): they traverse the less-stable principal-axes path, and the full-precision baseline alone re-centers their window.1e-99/1e-10) asserts untouched.All 12
ShapeLabelMapFixturetests pass locally (macOS arm64); clang-format clean. CI (esp. Windows MSVC) confirms whether the tightened1e-5holds.