Native theme fidelity suite + Material 3 fidelity fixes#5274
Native theme fidelity suite + Material 3 fidelity fixes#5274shai-almog wants to merge 9 commits into
Conversation
Adds a data-driven fidelity test suite (scripts/fidelity-app) that renders each component under the native theme alongside the REAL native OS widget (off-screen rasterized) and measures per-component visual fidelity, gated by a one-way ratchet vs a committed baseline. Android round raises overall Material 3 fidelity 94.9% -> 96.2% via real framework fixes (verified pixel vs the native golden, no metric softening): - FloatingActionButton: honor a fabDiameterMM theme constant for the Material 56dp fixed diameter instead of the icon*11/4 (~71dp) heuristic. FAB 85->98. - Tabs.paintAnimatedIndicator: read tabsAnimatedIndicatorThicknessMm as a float (an int read dropped "0.45" -> 2x-too-thick indicator). - Tabs.paintBottomDivider: new opt-in (tabsBottomDividerBool) full-width M3 divider painted directly (a border-bottom does not paint on the custom tab-row Container); colour from the TabsDivider UIID (light/dark aware). - DefaultLookAndFeel: disabled-unchecked checkbox/radio box reads the *UncheckedColorUIID's own .disabled style, so the greyed box outline can differ from the darker disabled label text (Material renders them distinctly). Theme (native-themes/android-material/theme.css) + recompiled shipped res. Host tooling: ProcessScreenshots --mode fidelity, RenderFidelityReport, FidelityGate (ratchet), cn1ss.sh helpers, run-*-fidelity-tests.sh, and the scripts-fidelity GitHub workflow. iOS round is blocked: rendering the native UIKit reference inside a ParparVM native method NPEs whenever it does real UIKit work (a trivial stub delivers; not a threading or marshaling fault). Documented in the iOS NativeWidgetFactory impl; needs a ParparVM fix or a PeerComponent+screenshot redesign. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
JavaSE simulator screenshot updatesCompared 11 screenshots: 10 matched, 1 updated. |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cloudflare Preview
|
Native fidelity (Android, Material 3)54 pairs compared -- median 93.8%, worst 75.4% ( Distribution --
Side-by-side comparisons (worst first)
|
Android screenshot updatesCompared 136 screenshots: 104 matched, 32 updated.
Native Android coverage
Benchmark ResultsDetailed Performance Metrics
|
- Switch.java: replace a non-ASCII U+2248 with ~ (Android port javac uses US-ASCII encoding and failed on it). - scripts/javase/screenshots: refresh the 7 simulator goldens that shifted with the framework/theme changes (rendered on CI Linux to match the test env). - scripts-fidelity.yml: TEMPORARY seed -- run the Android fidelity suite with FIDELITY_UPDATE_GOLDENS=1 + FIDELITY_UPDATE_BASELINE=1 so the native goldens and baseline are regenerated on CI's emulator density (the committed ones were rendered on a different local emulator, so 50/54 pairs "could not be compared"). Reverted in a follow-up once the CI-density artifacts are committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native goldens + ratchet baseline are now the ones the seed run regenerated on CI's own emulator (e.g. Tabs 377x100 vs the local 1039x277), so the fidelity gate compares like-for-like instead of failing 50/54 pairs on size mismatch. Removes the temporary FIDELITY_UPDATE_* seed so the job is a real one-way ratchet again. CI baseline overall fidelity: 96.2%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Compared 133 screenshots: 133 matched. |
|
Compared 131 screenshots: 131 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
|
Compared 134 screenshots: 134 matched. Benchmark Results
Detailed Performance Metrics
|
iOS fidelity native references now render (48 delivered, was 0). The earlier "ParparVM can't render UIKit in a native method" conclusion was wrong: it was three mundane MRC (non-ARC) memory bugs in NativeWidgetFactoryImpl.m -- 1. knownKind: cached an AUTORELEASED +[NSSet setWithObjects:] in a static, which dangled once the autorelease pool drained between native calls; the 2nd call derefed freed memory. ParparVM turns that EXC_BAD_ACCESS into a bogus Java NPE (which read as "buildAndRender NPEs"). Fixed: -[alloc initWithObjects:] (+1). 2. The rendered NSData was autoreleased and built on the main queue (UIKit layout -- e.g. SF-Symbol buttons -- hangs off-main, so the build is dispatch_sync'd to main); when dispatch_sync returned, main's pool drained and freed it before the EDT's writeToFile. Fixed: -retain it across the boundary, -release after. 3. (UIKit build moved to the main thread to avoid the off-main layout hang.) Report (RenderFidelityReport): lead with median / worst-pair / 25th-percentile / distribution buckets instead of a single misleading mean; add a per-pair percentage table (Fidelity, SSIM, mean-delta, delta-vs-baseline) sorted worst first; list unscored pairs explicitly; render the side-by-side cards for every pair worst-first. Workflow: drop continue-on-error on the iOS job (no longer a blocker); reseed per-environment goldens (FIDELITY_UPDATE_GOLDENS) while the committed baseline remains the portable ratchet floor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… app The off-screen UIKit factory render was bunk: it rasterized DETACHED widgets at scale=1.0, so a 30pt button was 30px inside a 1087px tile (tiny, wrong size), and UINavigationBar/UITabBar rendered blank without a window. Replaced it for iOS with the approach Shai asked for: - scripts/fidelity-app/ios-native-ref/NativeRef.swift: a standalone native iOS app that lays each reference UIKit widget out in a REAL UIWindow and captures it with drawHierarchy(afterScreenUpdates:) -- so nav/tab bars render correctly -- at CN1's pixel density (so the PNG overlays the CN1 render 1:1, no scaling). Built directly with swiftc (no Xcode project) by scripts/build-ios-native-ref.sh, which runs it on the simulator and copies the PNGs into the committed iOS goldens. - run-ios-fidelity-tests.sh: iOS now compares the CN1 render against these COMMITTED goldens (generated offline, not same-run) instead of the broken factory native. - ProcessScreenshots: tolerate a few px of cross-environment rounding (golden 1088 vs CN1 1087) by cropping both to their common top-left region before diffing -- a true 1:1 overlay, never a scale. Result: all 50 iOS pairs now compare against real, correctly-sized native widgets (Toolbar was 0% blank -> a real centred-vs-left-aligned title diff). Seeded the iOS ratchet baseline (mean 62.3%); the low scores are the genuine untuned-iOSModern-theme gaps to drive up next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Compared 135 screenshots: 135 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
The native and CN1 tiles both anchor the widget top-left, but their pixel sizes can diverge -- a few px of cross-environment rounding (iOS offline goldens), or a larger native-vs-CN1 tile-geometry gap that flakes between Android emulator runs (e.g. CN1 320 vs native 377). Failing those as "size_mismatch" broke the gate. Now both are cropped to their common top-left region and overlaid 1:1 (never a scale); the structural metric still crops to each widget's content bbox, so an honest extent difference scores lower rather than erroring. Only a degenerate overlap (<8px) is an error. TEMPORARY: FIDELITY_UPDATE_BASELINE=1 on both run steps to reseed the ratchet baselines on CI under the new comparison (reverted once the baselines are committed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The old score was the mean colour agreement over all widget-content pixels, so a
large flat region that happened to match -- e.g. a dark nav-bar fill against a
dark tile -- could carry the score into the high 80s even when the actual widget
(the title) was centred in one render and left-aligned at a totally different
font size in the other. "Mostly got points for being black."
Now fidelity = min(fillSim, structSim):
- fillSim = mean colour agreement over content pixels (the old term; catches
wrong fill colours).
- structSim = the same agreement WEIGHTED BY local-gradient salience SQUARED, so
flat fills count for ~nothing and the strongest edges -- glyph
strokes, crisp outlines, separators -- dominate. A mis-placed or
mis-sized title lands its strokes on the other render's flat fill,
collapsing this term.
A widget must now agree in BOTH fill AND structure/placement. Effect on the iOS
Toolbar that triggered this: 89.3% -> ~59% (dark) / 36% (light), matching the
independent SSIM (~56%), while genuinely-similar widgets (an off switch, disabled
buttons) stay in the mid-80s. This is stricter for Android too; the CI seed run
reseeds both ratchet baselines under it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


















































































































































































What
Adds a data-driven fidelity test suite (
scripts/fidelity-app) that, for every component with a native equivalent, renders the real native OS widget (rasterized off-screen) alongside the CN1 component under the native theme, and measures a per-component similarity score. Routine CI renders only the CN1 side and diffs against committed native goldens; a one-way ratchet (FidelityGate) fails only when a change drops a pair below its baseline.It then drives the Android Material 3 theme from 94.9% → 96.2% overall fidelity through real framework + theme fixes — every change verified pixel-for-pixel against the native golden, no metric softening.
Framework fixes (each fixes a real Material-fidelity bug)
FloatingActionButtonhonors afabDiameterMMconstant (Material's fixed 56dp) instead of the legacyicon*11/4(~71dp) heuristicTabs.paintAnimatedIndicatorreadstabsAnimatedIndicatorThicknessMmas a float (an int read silently dropped"0.45"→ a 2×-too-thick indicator)Tabs.paintBottomDivider(opt-intabsBottomDividerBool) paints the full-width M3 tab divider directly — a CSSborder-bottomdoes not paint on the custom tab-rowContainer; colour comes from theTabsDividerUIID (light/dark aware)DefaultLookAndFeeldisabled-unchecked checkbox/radio box reads the*UncheckedColorUIID's own.disabledstyle, so the greyed box outline diverges from the (darker) disabled label text, as Material renders themPlus the tuned
native-themes/android-material/theme.cssand recompiled shipped.res(Themes/, Ports, JS mirror).Host tooling
ProcessScreenshots --mode fidelity,RenderFidelityReport,FidelityGate(ratchet),cn1ss.shhelpers,run-{android,ios}-fidelity-tests.sh, and thescripts-fidelityGitHub workflow.Known limitation — iOS native references blocked
The iOS round cannot yet collect native UIKit references: rendering the native widget inside a ParparVM native method NPEs as soon as it does real UIKit work (a trivial stub delivers cleanly; reproduces identically with or without
dispatch_sync, and String-arg/BOOL-return marshal fine — so it is neither a threading nor a marshaling fault). Documented incom_codenameone_fidelity_NativeWidgetFactoryImpl.m. Resolving it needs a ParparVM runtime fix, or rendering the native reference via aPeerComponent+Display.screenshot()instead of a NativeInterface method. The Android off-screen path (View.draw→ Bitmap) works fully.🤖 Generated with Claude Code