Ow native gpu 0.1.0 -> 0.9.9 by pgleeson · Pull Request #230 · openworm/sibernetic

pgleeson · 2026-05-21T09:20:12Z

No description provided.

…ine-validation Adjust energy test iterations and add reference logs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…e-validation

…for-engine-validation Limit energy test to 50 iterations

…for-engine-validation Allow shorter test runs

…prove-documentation Add repository guide and docstrings

…n-for-tests Ensure OpenCL tests run

…-solver Fix solver destructor segfault

…-solver-with-logs Add reference solver logs for tests

performance optimization: loops vs vectorization Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

More performance optimization by vectorizing Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…and-implement-tensor-operations Add PyTorch solver, unit tests, and installation scripts / instructions.

…og-file-issue Ensure OpenCL runtime and subprocess fallback

…with-pybind11 Enable optional PyTorch solver

add install for pytest

…d-for-pytorch-solver

…torch_backend Fix torch backend velocity output

…ll_bwd Adds param-gradient kernels and host accumulators so SGD over fluid params no longer requires finite differences (which were noisy in the chaotic snapshot-loss regime). New kernels: - spring_K_partial: per-particle scalar = <∂(spring_accel_i)/∂K, ga_i> - visc_K_partial: per-particle scalar = <∂(visc_accel_i)/∂visc_K, ga_i> Both read bGext (= dt · dL/dv_after_apply) and per-particle inputs (positions, velocities, density). Host sums per-particle outputs into total_grad_spring_K / total_grad_visc_K accumulators. xpbd_full_bwd now writes these as scalar files at optional argv[21], argv[22]. Validation (test_param_grads.py): toy 6-particle chain. - spring_K analytic vs FD: rel err 7.8e-4 ✓ PASS - visc_K analytic vs FD: rel err ~17% (FAIL the 10% threshold) The visc_K rel err is the **stop-gradient through density** missing chain — visc force scales as 1/ρ_i, and ρ_i depends indirectly on visc_pair_coef through positions across timesteps. Treating ρ as stop-gradient captures the dominant direct-dependence path; the implicit density-perturbation path adds the remaining ~17%. For SGD purposes, having gradient correct to ~80% still points in the right direction. Also adds: - sgd_tune.py — original FD-based SGD harness (useful baseline) - random_search_tune.py — random-search fallback for chaos basins - sgd_analytic.py — wrapper that uses production xpbd_step for fast forward eval, FD for params (analytic chain via xpbd_full_bwd is available but xpbd_full at K=50000 is slow due to dense N²) Empirical findings during exploration: - The cube-fall snapshot at t=1s is chaotic — small param changes can shift bounce phase, making finite-diff gradients noisy - Random search found best L=4.3e-3 (5.3% Δm_y err, 3.9% extent err) - SGD from baseline diverged at lr=0.05 (5% step crossed chaos cliff) - True path to <1% requires either: (a) smoother loss formulation (time-averaged or settled-state metrics) (b) running longer until cube settles past bouncing Both are research directions, not bug fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed BPTT Three changes wiring up the analytic-gradient SGD pipeline: 1. Floor constraint (solve_floor_constraint_with_mask + backward) wired into xpbd_full_fwd/bwd. New optional [floor_y] arg matches xpbd_step's floor handling so gradient flow goes through the same physics. - per-step state extended by 1*n_active int32 (clamped mask) - forward saves mask, backward restores + dispatches floor_bw 2. predict_positions_backward bug fix: was missing sim_scale_inv factor on velocity gradient. Forward is `pos += v · dt · sim_scale_inv`; backward `dL/dv = grad_x_pred · dt · sim_scale_inv`. Old code did `dL/dv = grad_x_pred · dt` only — under-counted velocity gradient by 135135x when sim_scale=7.4e-6 (Sibernetic units). All 4 callers updated to pass sim_scale_inv buffer (toy ops pass 1.0). 3. Truncated BPTT support: env var BWD_TBPTT=N walks back only the last N steps in xpbd_full_bwd's reverse loop. Default = K (no truncation, current behavior). Empirically observed: gradient through chaotic xpbd_full backward amplifies ~3× per step. K=10 gives g_rho ~1e7, K=200 gives ~1e25, K=500 overflows fp32 to NaN. This is FUNDAMENTAL to backprop through chaotic dynamics — not a bug. Standard ML technique for chaotic systems is TBPTT (truncated backprop through time). Now available as `BWD_TBPTT=N` env var. Combine with gradient clipping in the SGD harness for stability. All existing regression tests still pass: - test_xpbd_full (∂L/∂ρ rel err 3e-5) - test_xpbd_full_visc (with pair_forces) - test_xpbd_full_spring (with springs) - test_param_grads (spring_K rel err 7.8e-4; visc_K 17% off due to stop-gradient through density) For SGD on the demo1 cube-fall: - Forward + backward cost: K=50000 → 60s + 480s = ~9 min/iter - TBPTT=N reduces backward to N/K of full cost - Gradient explosion still requires careful loss formulation or TBPTT + clipping; demo1's bouncing-cube snapshot loss is at the edge of what's tractable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Backprop through 50000 steps of chaotic xpbd dynamics overflows fp32 (gradient amplifies ~1000× per step in this regime). Add: - --tbptt N: env-var driven truncated BPTT window (xpbd_full_bwd walks only last N steps in reverse loop) - --clip-norm: clip global L2 norm of log-space gradient - NaN/Inf filter: replace non-finite components with 0 to avoid parameter blow-up if TBPTT N is still too large Empirical regime for demo1 cube fall: - TBPTT=3-5: gradients finite, clipped from ~1e8 (TBPTT=3) to ~1e18 (TBPTT=5). Direction preserved by clipping. - TBPTT=8+: NaN — fp32 overflow during backward. Tradeoff: TBPTT=5 only credits parameter influence on the LAST 5 steps of dynamics (sim_time = 100 µs of the 1.0 s run). Convergence will be slow because parameter→trajectory effects further back are invisible. Mitigates the chaos, doesn't escape it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end SGD using xpbd_full_fwd + xpbd_full_bwd analytic chain (no FD, no random search). Captures: - 4-part substrate work (param grads, floor, predict_bw bug fix, TBPTT) - 12-step SGD trajectory: loss dropped 50× (0.147 → 0.0023) - Best simultaneous metrics: Δm 2.88%, ext 3.86% at step 10 - Best ext alone: 1.91% at step 11 - Pareto frontier observed; no single point under 1% on both - Honest discussion of chaos-driven gradient explosion + TBPTT remedy Substrate has correct analytic gradients now. Reaching sub-1% on both metrics simultaneously requires loss reformulation (settled-state, time-averaged), which is the natural follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three SGD runs total (12 + 15 + 10 iters): - Loss reduced 140× from start (0.147 → 0.0011) - Extent metric reached 0.01% (essentially perfect on this metric) - Δm settled around 3-3.3% — chaos prevents further reduction - Pareto trade-off persists; no point under 1% on BOTH simultaneously Surprise finding: SGD-optimal params for xpbd_full DON'T transfer to xpbd_step. Same algorithm, different fused-vs-unfused fp32 rounding, different bounce phase at t=1s under chaos. Documented honestly with three concrete next-step paths for cross-substrate transfer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The original demo1 puts the cube at box geometric center (44.42, 44.42, 44.42) but the boundary particles are not at perfect bounding-box center — the inner-face geometric center is at (43.59, 43.59, 43.59). The 0.83-unit offset creates an asymmetric pressure field that nudges the cube horizontally, so it slides across the floor as it falls instead of dropping straight down. This config shifts the cube (elastic + liquid) by (-0.83, 0, -0.83) so the cube starts at (43.59, 44.42, 43.59) — directly above the inner geometric center. With this fix the cube falls straight down without sliding, and visual comparisons across substrates show clean cube-fall behavior instead of chaotic horizontal drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User caught visually that demo1's cube was sliding across the floor instead of falling straight down — a quirk of the boundary particles being asymmetric (cube at box geometric center is NOT at boundary inner geometric center; off by 0.83 units). Lateral pressure gradient nudges the cube horizontally, then chaos amplifies into "hockey-puck- on-ice" behavior that LOOKED clearly broken on visual inspection but was passing all our existing numerical checks. Three things in this commit: 1. configuration/demo1_centered — cube shifted by (-0.83, 0, -0.83) so it starts at boundary geo center (43.59, 44.42, 43.59). With the cube symmetrically placed, it falls straight down under gravity. 2. tests/test_cube_fall_trajectory.py — codifies the "looks correct" visual check as automated trajectory-equivalence assertions: - Initial cube position matches expected (within 1 unit) - Horizontal drift over the trajectory < 5 units (no slide) - Final cube min_y near floor (lands) - Y-extent retention in [0.5, 2.0] (no pancake, no explode) - Cube center y stable in last 0.5s (settled, not still falling) Runs as `python3 tests/test_cube_fall_trajectory.py path/to/position_buffer.txt`. Reads either Sibernetic's variable-frame format (boundary only in frame 0) or our normalized fixed-frame format. Same pass/fail criteria for both — a substrate is equivalent to the gold standard if and only if the trajectory check passes. Validated against three datasets: - Metal substrate, no springs, no pair_forces (centered cfg) → PASS (drift 1.34, extent retention 0.96) - Metal substrate, no springs, WITH pair_forces (centered cfg) → FAIL (drift 52.7, caught the slide that visual inspection found) - Sibernetic local Taichi-Metal (centered cfg) → FAIL (extent retention 0.00, caught the known Taichi pancake bug) 3. scripts/render_cube_fall_compare.py — side-by-side MP4 renderer. Reads two position_buffer.txt files, samples N matched sim times, renders 3D scatter plots with fixed box-bounds camera so visual comparison shows the actual cube trajectory (not autocamera-zoomed onto whatever's left). `python3 scripts/render_cube_fall_compare.py out.mp4 left.txt right.txt "Left Label" "Right Label"` The comparison movie at ~/Downloads/cube_drop_compare.mp4 shows Sibernetic Taichi (pancakes) vs Metal Native (intact) side by side. Both are running on demo1_centered so neither slides — only their intact-vs-pancake behavior differs. Note on cross-substrate scope: Sibernetic OpenCL gold standard requires Cloud Run; the baked image doesn't have demo1_centered yet (would need a redeploy with the new config baked in). For now this comparison uses Sibernetic local Taichi-Metal which has the known elastic-coordinate- scale bug. Once Cloud Run is redeployed with demo1_centered, swap the LEFT panel for OpenCL output for the gold-standard reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Course-correction: the right regression test for substrate equivalence is "Metal-on-demo1 vs OpenCL-on-demo1 within tolerance," NOT "cube falls straight under idealized physics." demo1's boundary is asymmetric (3459 -x face vs 3671 +x face); both backends slide as a result. They should slide the *same way* — that's equivalence. Removes: configuration/demo1_centered (engineered around demo1's quirks) tests/test_cube_fall_trajectory.py (asserted idealized physics) Adds: tests/test_demo1_backend_parity.py — diffs two trajectories from same config: per-particle L2, cube-center diff, cube-extent diff, init + final agreement. Reads optional .times.txt sidecar for variable chunk sizes. scripts/render_demo1_parity.py — pyvista shaded-sphere renderer with quad time-warped sampling so the 8ms physical descent fills 30+ video frames (no teleport). src/metal_diff/dump_metal_trajectory.py — variable-chunk schedule (--dense-steps + --dense-chunk for descent, --chunk for settle) and per-frame .times.txt sidecar. First parity run on demo1 dense (logstep=20): mean per-particle L2=53, max cube-center diff=96 — substrates have NOT converged. That's the honest baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Default was 2.5e-13 (toy-test convention), but Sibernetic configs use 1000 kg/m³ (water density, per inc/owPhysicsConstant.h: const float rho0 = 1000.0f). With wrong rho_rest the density-constraint solver massively overcorrects: cube fell 39 sim units in 2ms (vs OpenCL's 7ms), then bounced violently to the top of the box and oscillated. With rho_rest=1000 (correct value): - cube falls under gravity at the right rate (8.4ms to floor) - lands with normal squash (extent_y 10 → 4.8) - settles at cy=1.47, drifts only 0.135 sim units over 25ms Parity test on dense data: before: mean per-particle L2 = 53.8, cube-center diff = 95.7 after: mean per-particle L2 = 6.2, cube-center diff = 10.4 Still fails strict thresholds (Metal compresses more, lands ~1ms early) but the substrate is now in the right ballpark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lags CLI overrides for trajectory targets (so we can match OpenCL at 25ms instead of the hardcoded long-trajectory targets) and per-parameter freeze flags so we can tune just spring_K without touching rho or v. Tuning result for demo1 25ms cube fall (OpenCL parity): K=2220 (Sibernetic default): ext_y=4.79 at landing (over-compressed) K=5500 (best): ext_y=9.46 at landing K≥9500: NaN divergence (XPBD constraint solver unstable) Beyond K=5500 the cube extent saturates at ~9.46 — substrate-level limit. Remaining gap to OpenCL's ext_y≈11.2 is XPBD vs PCISPH on-impact behavior; not closable by spring stiffness alone. Parity test on demo1 0-25ms with K=5500: before tune (K=2220): center_diff=10.4, ext_diff=7.6 after tune (K=5500): center_diff= 7.1, ext_diff=5.1 4/5 parity checks now pass; only ext_diff above threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When inspecting short timescales (e.g. 0-25ms cube descent) time-warped sampling creates the illusion of "skating" because samples jump from e.g. 12ms to 86ms to 759ms. Uniform mode samples evenly across [0,t-max] so motion appears at correct speed. Also --hide-liquid filters type-1 liquid particles from the render — demo1 is elastic shell + 125 liquid particles inside; when liquid leaks out at impact, it visually obscures the cube. Use: --uniform # disable quad time-warping --t-max 0.025 # show only first 25ms --hide-liquid # only render the elastic cube particles Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pping Two related fixes that together make xpbd_full_bwd return stable finite gradients on demo1, where it previously produced NaN within 5 chain steps regardless of TBPTT setting. Fix 1 (kernel): density_constraint_grad_backward divided by `r` in g_hat = dir/r and h_minus_r/r factors with no epsilon. Particles in contact (r → 0 post-impact) produced inf/NaN. Replaced with r_safe = max(r, R_MIN) where R_MIN = 1e-4 sim units, and skip pairs with r2 < R_MIN². Other backward kernels already had similar guards. Fix 2 (host): xpbd_full_bwd chain rule amplifies position gradients by ~sim_scale_inv (1.35e5) per step because update_vel backward divides by dt and predict backward multiplies by dt·sim_scale_inv. With Sibernetic's mass=2e-12, gradients overflow to NaN within 5–10 chain steps — fundamental property of differentiable chaos with microscale parameters, not a bug. Added BWD_CLIP_NORM env var: per chain step, zero NaN/Inf and L2-clip running grad_x and grad_v to the cap. Yields biased-but-bounded gradients usable by Adam in log-space. Verification (TBPTT scan on demo1 1250 steps): Before: TBPTT=1: g_K=-3e-9 g_v=NaN gx0_NaN=414/1029 TBPTT=5: g_K=NaN g_v=NaN gx0_NaN=1029/1029 After (BWD_CLIP_NORM=1e3): TBPTT=1: g_K=-3e-9 g_v=+2.4e-4 gx0_NaN=0/1029 TBPTT=100: g_K=-1.4e3 g_v=-1.5e8 gx0_NaN=0/1029 TBPTT=1250: g_K=-5.1e4 g_v=+9.7e10 gx0_NaN=0/1029 SGD on K + v with the fixed backward (TBPTT=100, lr=0.05, BWD_CLIP_NORM=1e3): step 0: L=0.0354 K=5500 step 9: L=0.0339 K=8626 v=5.12e-5 (best, 4% loss reduction) step 11: K crossed stability boundary (≥9500 → forward NaN) SGD confirms manual sweep finding: optimum sits on the K-saturation plateau (5500-8625), right at the stability cliff. Remaining 16% extent gap to OpenCL is algorithmic (XPBD vs PCISPH on impact), not closable by spring tuning alone. Also adds src/metal_diff/sgd_fd.py — finite-difference SGD as a fallback for parameters not exposed in xpbd_full_bwd analytic gradients. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends solve_density_constraint_backward to also compute ∂L/∂A where A = alpha_inv_dt2 (= alpha_dens / dt²). Math: with Δλ = -(C + A·λ_pre)/D, the derivative simplifies to ∂Δλ/∂A = -(λ_pre + Δλ) / D = -λ_post / D so per-particle gradient = chain · (-λ_post/D), where chain = (ω·g)/m + ψ. Host sums per-particle contributions, divides by dt², writes to argv[24] of xpbd_full_bwd. sgd_true.py now learns alpha_dens via Adam alongside rho/K/v with --alpha-init / --freeze-alpha CLI flags. FD validation: - Per-particle kernel test (test_dens_alpha_grad.py) on contrived inputs where C > 0 for all particles: rel_err = 1.9e-4 vs centered FD. PASS. - Existing solve_density_constraint backward (rho/grad_C/etc) still passes after kernel signature change (buffer index 12 added for grad_alpha_inv_dt2_per_particle, downstream constants shifted). Practical note for the current demo1 tuning: with rho_rest=1000 the density solver is effectively disabled (computed densities are O(1e-12) in sim-density units, so C = density/rho_rest - 1 ≈ -1 always; one-sided projection skips). In that regime ∂L/∂alpha_dens = 0 — analytic gradient correctly identifies alpha_dens as a non-tunable knob. To make alpha_dens matter for tuning, rho_rest would need to be in the same sim-density units (~1e-13 to 1e-11) where C > 0 triggers the solver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… parity fix Two parity-narrowing changes for Metal-vs-OpenCL on demo1: 1. ELASTIC FLOOR. Replaced inelastic clamp (x.y = floor_y) with reflective: if (x.y < floor_y): δ = floor_y - x.y x.y = floor_y + e · δ e ∈ [0,1]: 0 = legacy clamp, 1 = perfectly elastic bounce. Forward kernel solve_floor_constraint and the masked variant solve_floor_constraint_with_mask both now take a restitution constant. Backward kernel updates the gradient accordingly: ∂post.y/∂x.y = -e (was 0, killing all y-gradient on clamped) ∂post.y/∂floor_y = 1 + e Threaded through xpbd_step, xpbd_full_fwd, xpbd_full_bwd, and the standalone step_floor_fwd/bwd test ops. 2. FLOOR HEIGHT MATCHING. OpenCL's PCISPH boundary band makes the cube sit on TOP of boundary particles at y≈1.06, not at y=0. Metal's default floor_y=0 puts the cube ~1 unit lower than OpenCL. Setting floor_y=2.0 in dump_metal_trajectory.py raises Metal's cube to match OpenCL's effective floor location (cy=6.64 vs OpenCL 6.68). Parity test results on demo1 0-25ms: K=2220 (Sibernetic default): 4 fails (only init passes) K=5500 + rho fix: 1 fail (cube-extent), mean L2=4.86 K=5500 + floor_y=2.0: 1 fail (cube-extent), mean L2=4.01 17% improvement in mean per-particle L2; max cube-center diff dropped from 7.08 to 5.15. Remaining gap (cube_extent_diff=5.39) is a substrate-level difference NOT closable by parameter tuning: OpenCL's PCISPH performs an axis-asymmetric initial expansion via the density solver in the first 0.4 ms (ext_x=12.30, ext_y=12.36, ext_z=9.40, from initial 10.02). Our XPBD substrate's density solver is inactive at rho_rest=1000 (computed densities are O(1e-12) in sim units → C ≤ 0 always). To replicate OpenCL's puff-up, the density solver would need to operate in matching sim-density units AND remain stable (it explodes at rho_rest=2.5e-13). Future work: separate "soft" boundary repulsion that mimics OpenCL's density-driven cube expansion without enabling full PCISPH dynamics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ble substrate Removes the auxiliary backend implementations (Taichi-Metal/CUDA/CPU and PyTorch-CPU/MPS/CUDA) and consolidates Sibernetic on two paths: the OpenCL gold-standard PCISPH solver and the new native Metal XPBD substrate at src/metal_diff/sib_metal. The PyTorch and Taichi solvers were never reliable past the cube-only elastic case (Taichi pancake bug, PyTorch CPU intractably slow, MPS autodiff coverage gaps); their intended role — differentiable physics for parameter learning — is now served by the native Metal substrate's hand-derived analytic backward kernels. Files removed: taichi_solver.py (1823 lines) pytorch_solver.py (1589 lines) differentiable_solver.py (747 lines) tests/test_pytorch_solver.py tests/test_torch_backend.py tests/test_physics_parity.py (PyTorch ↔ OpenCL parity) tests/test_cpp_integration.py (PyTorch C++ bridge) tests/test_validation.py (PyTorch validation) tests/test_cube_stability.py (cross-backend; only OpenCL/Metal left) tests/conftest.py (PyTorch fixtures only) C++ purge: inc/owConfigProperty.h : useTaichi/taichiDevice/getTaichiDevice/torchEnabled/getTorchDevice inc/owPhysicsFluidSimulator.h : torchSolver/taichiSolver/init*Solver, Python.h include src/owConfigProperty.cpp : backend=taichi*/torch* parsing → opencl-only src/owPhysicsFluidSimulator.cpp : initTorchSolver/initTaichiSolver bodies (-249 lines), simulationStep torch+taichi paths (-119 lines), destructor torch+taichi cleanup, Python.h/numpy includes makefile.OSX : misnamed TORCH_FLAGS → ARCH_DEFINES (it was always just -DOW_NO_OPENCL on Apple Silicon) CI workflow update (.github/workflows/ci-build.yml): build-macos job replaced. Old: build main binary + Taichi-CPU + Taichi-Metal smoke tests. New: build main binary (build-system regression check) + build src/metal_diff/sib_metal + run xpbd_step on demo1 with sanity checks (no NaN/Inf, particles within box). The native Metal substrate is the actual GPU-simulation path on Apple Silicon, so this is what should be exercised in CI. Linux jobs (ubuntu-22.04, ubuntu-24.04): dropped torch/taichi pip installs. Documentation: README.md : full rewrite. New "Two solver paths" overview, PCISPH→XPBD transition explained, dedicated "Differentiable physics with native Metal" section covering what's differentiable, why it's useful, how to use sgd_true.py and the BWD_TBPTT/BWD_CLIP_NORM env vars, and the FD-validation tests. DEVELOPMENT_LOG.md : new top-level "Native Metal substrate" section summarizing M1-M9 work, parity findings (rho_rest unit fix, floor_y=2.0, K plateau, elastic floor, alpha_dens analytic backward), performance numbers. Existing PyTorch/Taichi history retained as design context. src/cuda/README.md : drop "Why not just use Taichi-CUDA" section (the answer is now: we don't have Taichi anymore, period). src/metal_diff/README.md : "Why hand-written" section updated. AGENTS.md : remove backend=torch CLI example. Setup + tests: setup.sh : drop pip install of torch/taichi; install only numpy + pyneuroml. run_all_tests.sh : drop torch/pytorch pytest invocations; run only test_energy.py + test_solver_logs.py. scripts/cross_backend_regression.py : default backend list now [opencl]; docstring updated to reference src/metal_diff/sib_metal as the local-binary parity comparison. Build verification: make all -f makefile.OSX produces ./Release/Sibernetic on Apple Silicon (with OW_NO_OPENCL); ./Release/Sibernetic -h prints usage. src/metal_diff/build.sh produces sib_metal binary; existing FD-validation tests (test_xpbd, test_solve_dens_bwd, test_dens_alpha_grad) still pass. Net: -7,343 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The monolithic sib_metal.mm became hard to navigate as the substrate grew. Reorganised into smaller files grouped by op category: metal_common.{h,mm} 208 lines shared utilities: MetalCtx, make_ctx, make_pso, load_shader_source, GridDim3 + GridOrigin3, build_static_grid ops_kernels_m6.mm 515 lines atomic M6 kernel ops (FD-validated building blocks: dist_*, wpoly6, rowsum, density_constraint_grad) ops_xpbd_step.mm 565 lines M7 imperative pipeline (xpbd_step) ops_xpbd_full.mm 1247 lines differentiable pipeline (xpbd_full_fwd + xpbd_full_bwd) — the core of the differentiable substrate; sgd_true.py drives these two ops ops_pair_spring.mm 368 lines pair_forces + spring_bonds, fwd+bwd (FD-validated test ops) ops_test_steps.mm 922 lines step_simple/floor/bond fwd+bwd (kernel-chain test ops) ops_test_density.mm 734 lines density-pipeline unit-test ops (density_as/aa, density_constraint grad bwd, solve_density_constraint fwd+bwd) sib_metal.mm 106 lines slim main + op dispatcher with extern decls of every run_* op Total split content: 4685 lines across 9 files (the +115 over the original 4570 is per-file headers and module-explainer comments). Mechanical refactor: all run_* function bodies are byte-identical to the originals; only `static` qualifiers were stripped where needed for cross-TU linkage, and shared types (MetalCtx, GridDim3, GridOrigin3) moved from .mm to metal_common.h. build.sh updated to compile all .mm files together. Verification: cd src/metal_diff && ./build.sh # produces sib_metal binary python3 test_xpbd.py # PASS (cube integrity) python3 test_solve_dens_bwd.py # PASS (FD-validated rho_rest grad) python3 test_dens_alpha_grad.py # PASS (FD-validated alpha grad) python3 test_learn_floor.py # PASS (SGD converges) End-to-end demo1 cube-drop trajectory matches pre-split output exactly: parity test against OpenCL reference gives identical L2 values (mean=4.01, center_diff=5.15, ext_diff=5.39 — same 4/5 checks pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds docs/cube_drop_demo1_25ms.mp4 (880 KB, 60 frames at 15 fps) — the side-by-side OpenCL-vs-Metal comparison on demo1's first 25 ms, with floor_y=2.0 + spring_K=5500 tuned. README links to it as a "Preview" callout near the top so contributors see the substrate behaviour without having to rebuild and run. Three new README sections: 1. **Quickstart: render the demo1 cube drop**. Step-by-step walk-through of how to regenerate the bundled MP4 from scratch on Apple Silicon — build both binaries, run sib_metal on demo1 (1250 steps, 25 ms), fetch OpenCL reference, render side-by-side via render_demo1_parity.py. 2. **Parity status and caveats vs OpenCL**. Promotes the previous single-paragraph status to a structured caveats list: - cube_extent_diff still fails — XPBD's one-sided density constraint doesn't replicate PCISPH's initial cube puff-up - long-timescale floor sliding diverges between substrates after ~100 ms (both physically valid for demo1's microscale) - differentiable backward gradients are biased due to per-step gradient clipping (essential for stability through chaotic dynamics) - membranes not yet ported to Metal substrate (demo2 unsupported) - no worm/c302 bridge in Metal substrate yet Each caveat links to the relevant kernel/file. 3. **Path to a native CUDA backend**. Details the strategic plan to mirror src/metal_diff/ (not the legacy owSolver abstract base) for the eventual NVIDIA-native substrate. Layout sketch, MSL→CUDA translation rules, 5-phase build plan (~2 weeks), reasoning for choosing CUDA over OpenCL on NVIDIA hardware. Also updates src/cuda/README.md to align with the metal_diff-mirror strategy: the previous "mirror PR #222 abstract base" plan has been replaced with "mirror metal_diff" since metal_diff already worked out the differentiable contract (paired forward/backward kernels, shared metal_common.h, thin extern dispatch). The previous file's phase-by-phase implementation plan stays — just the target architecture changed. Net: README +246 lines, src/cuda/README +106 lines, MP4 +880 KB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GitHub markdown does not auto-render MP4 links inline, but it does render GIFs. Added docs/cube_drop_demo1_25ms.gif (690 KB, 12 fps, 600 px wide, palette-optimized via ffmpeg) so the cube-drop comparison plays inline in the rendered README. The MP4 is preserved as the higher-quality download below. Generated via: ffmpeg -y -i docs/cube_drop_demo1_25ms.mp4 \ -vf "fps=12,scale=600:-1:flags=lanczos,palettegen=stats_mode=full" \ /tmp/palette.png ffmpeg -y -i docs/cube_drop_demo1_25ms.mp4 -i /tmp/palette.png \ -lavfi "fps=12,scale=600:-1:flags=lanczos[x];[x][1:v]paletteuse=dither=bayer:bayer_scale=5" \ docs/cube_drop_demo1_25ms.gif Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n service The runner at ~/Documents/sibernetic-runner/ is slarsontech-private infra and shouldn't be referenced from the public openworm/sibernetic repo. Cleanup: - scripts/cross_backend_regression.py: rewritten as a local-only comparator. Strips the hardcoded URL + API key, drops Cloud Run client logic (submit_cloud_run, poll loop, GCS result fetch). New CLI takes one or more --binary "NAME=PATH" args; first PASS run becomes the comparison baseline. Cube-stability evaluation logic unchanged. - README.md: re-worded the OpenCL-reference fetch step in the "Quickstart: render the demo1 cube drop" section to point at a local Linux OpenCL run rather than the private runner. Performance table no longer mentions specific Cloud Run latencies. - DEVELOPMENT_LOG.md: historical perf entries kept (the L4 numbers came from real measurements), but the specific service ("Cloud Run", "sibernetic-runner", "Cloud Run image") is genericized to "remote L4 GPU" / "remote runner". - src/cuda/README.md: "Why not just use OpenCL on NVIDIA?" section reworded. - src/metal_diff/sgd_tune.py: docstring reference updated. Note: the API key `sibernetic-runner-2026` and runner URL were previously committed to scripts/cross_backend_regression.py and pushed to the public branch. They should be considered exposed and the runner's API key rotated. A history-rewrite commit follows separately to scrub the values from older commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds: docs/demo2_membranes_opencl.mp4 8.7 MB full-res, 1920×1088, 30 fps, 3.5 sec covering 200 ms of sim time on demo2 (4 732 liquid particles dropping onto two side-by-side elastic sheets — left has membranes, right doesn't) docs/demo2_membranes_opencl.gif 1.3 MB 720-px-wide, 10 fps, palette- optimized inline preview The membrane effect is plain: liquid pools on top of the left membrane-clad sheet and passes through the right unmembraned sheet to the floor. This is the canonical visualization of the membrane physics that the Metal substrate doesn't yet model, and a useful reference for when membrane support gets ported. README: new "Preview: demo2 membrane permeability" subsection with the inline GIF, sized parameters, and a forward-link to caveat #4 explaining why this works on OpenCL but not yet on Metal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 of the membrane port (parity plan). Stops silently ignoring the two membrane sections in Sibernetic configs: [membranes] -> int32[N_membrane, 3] triangle vertex IDs, remapped from global to active-buffer-local indexing (matches the bonds-buffer convention). Written to <scenario>_membranes.bin. [particleMemIndex] -> int32[n_elastic*7] flat list of incident membrane indices per elastic particle, -1 for unused slots. MAX_MEMBRANES_INCLUDING_SAME_PARTICLE = 7 matches sphFluid.cl:50. Written to <scenario>_pmem_index.bin. info dict now exposes 'n_membranes', paths['membranes'], paths['pmem_index']. Verified counts on demo1 (432 triangles), demo2 (2112), worm_alone_half_resolution (2838). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 2 of the membrane port. Adds three Metal kernels mirroring the OpenCL membrane mechanism in sphFluid.cl:1042-1322: clear_membrane_correction — zero mem_corr accumulator accumulate_membrane_correction — per-liquid: scan elastic neighbors within r0, project onto incident triangles, accumulate Ihmsen-weighted normals, write delta to mem_corr apply_membrane_correction — pos += mem_corr Plus two inlined helpers: calc_det_3x3 and project_to_plane (the 3×3 Cramer's rule projection from sphFluid.cl:1066-1105). The projection's degenerate-triangle case uses |denom| < 1e-9 to skip rather than the OpenCL .w == -1 sentinel — Metal is type-safe so we need an out-param + bool return. ε safeguards at the four flagged sites: - project_to_plane: |denom| < 1e-9 → skip - distance check: r2 < 1e-12 → skip (coincident particles) - normal length: n_len2 < 1e-12 → skip (already on plane) - weight sum: w_sum < 1e-9 → skip (matches sphFluid.cl:1277) Active-buffer ordering convention: elastic [0, n_elastic), liquid [n_elastic, n_active). No separate types buffer needed; the kernel uses gid >= n_elastic to filter to liquid (matches OpenCL's LIQUID_PARTICLE check). Verified: shader library still compiles, existing test_xpbd.py PASSES. Wiring into the pipeline (and FD-validating a backward) come in Phase 3 / Phase 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

backward, SGD-tuned to OpenCL parity Lands the membrane physics needed to run demo2 on the native Metal substrate, plus three adjacent fixes that surfaced during the port: boundary anchor springs (sheets stay aloft instead of crashing to floor under gravity), correct rho_rest in sim-density units (density constraint actually engages at sheet contact), and hard box-wall + velocity clamp (impact-event ejection contained instead of permanent escape). What's new: - M10 membranes: clear / accumulate / apply kernels in shaders.metal ported from sphFluid.cl:1042-1322, with hand-derived analytic backward through plane projection + unit-normal averaging + Ihmsen weighting. FD-validated to <7e-4 relative error in test_membrane_correction.py (single-triangle, shared-edge, multi-liquid). Multi-step backward chain wired into xpbd_full_bwd; test_xpbd_full_membrane.py passes K=1/2/3/5. - M11 boundary anchor springs: load_config.decode_bonds() was silently dropping 670 elastic→boundary connections in demo2's [connection] table because their j index didn't map to active-buffer local. Added separate anchor record format + spring_anchor_force kernel + --anchor-k CLI; without these, sheets fall under gravity and crash to floor in 20 ms. - M12 SPH pressure force kernel as backup; off by default since correct rho_rest makes the existing density constraint provide pressure-like behavior on its own. - M13 box-wall + velocity clamp: impact-event ejection (~1000 OOB particles flying off in random xz directions) reduced to ~0. velocity_clamp at 1.0 m/s prevents CFL violations. - New SGD harness sgd_demo2_permeability.py: finite-diff over dump_metal_trajectory CLI, loss = (y_mem-target)² + (y_por-target)² matching OpenCL t=16ms per-side liquid retention. Dropped loss 562 → 220 (61%) over 13 iterations to K=2697, anchor_k=12177. Adjacent fixes: - dump_metal_trajectory.py: was zeroing initial vel_active; now reads from cfg['vel'][:, :3] (demo2 liquid starts at v_y=-0.027 m/s). - dump_metal_trajectory.py rho_rest default: was 1000 (kg/m³, water); Metal density kernel uses h in particle units giving kg/sim_unit³, so default is now 8e-13 — the constraint only fires when liquid hits elastic neighbors (~10× rest density at sheet contact). - ops_xpbd_step / ops_xpbd_full: membrane apply moved to AFTER update_velocities so it's a position-only soft constraint instead of a 41 750 m/s velocity injection. - ops_xpbd_full xpbd_full_bwd: gradient-explosion diagnosis (not a NaN bug) — TBPTT=3 + log-space clip-norm=10 keeps gradients finite at Sibernetic-scale params; per-step amplification is ~400 000×. Per-side liquid behavior at t=16ms (target = OpenCL): membrane side: y_mean=69 (target 56) — over-retains slightly porous side: y_mean=13 (target 6) — slightly less than full passthrough README updated with new demo2 GIF + chronology of the porting work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion/one_sprig_test The single-spring oscillation test (1 elastic on 1 spring anchored to a boundary, stationary box) lives at configuration/test/one_sprig_test in the historical layout. Sibernetic's `-f` flag only looks at configuration/<name>, not configuration/test/<name>, so this top-level copy makes it usable directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…muscle drive The smallest elasticity scenario in the suite: one elastic particle on one spring tied to a fixed anchor. Validates that the substrate's spring/anchor machinery reproduces OpenCL period and amplitude exactly (both backends read y = 18.606 at t = 2.000 ms with anchor_K = 555). - shaders.metal: spring_anchor_force kernel extended with three new parameters (muscle_freq, muscle_amp, time_t) for time-varying anchor rest length: L_eff = L_base * (1 + amp * sin(2π * freq * t)). - ops_xpbd_step.mm: per-step host-side time_t buffer with phase continuity carried across chunked invocations via time_offset_s. - dump_metal_trajectory.py: --muscle-freq / --muscle-amp CLI args with cumulative_steps tracking; explicit '0' placeholder protocol when optional CLI slots between vel_clamp and muscle args are skipped. - load_config.py: read 6 bare leading floats as box bounds when [simulation box] header is absent (one_sprig_test format). - render_one_sprig.py: trajectory renderer with auto-detection of OpenCL (frame0 active+boundary, subsequent active-only) vs native Metal (active+boundary every frame) format. - sgd_one_sprig.py: FD-SGD harness over the dump_metal_trajectory CLI; loss = weighted ((period - target_period)^2 + (half_amp - target_amp)^2) on log-space anchor_k. Confirms the kernel converges to anchor_K = 555. - README.md: preview section paralleling demo1 / demo2 with side-by-side GIF.

Replace the three Preview blocks at the top of README with a single 2x2 combined GIF (sibernetic_demos_combined.gif) plus a one-line description per demo and a pointer to the new docs/demos.md. Roll the prior demo1-specific Quickstart into a single, demo-agnostic "Quickstart: run a demo comparison" section that points readers to the per-demo recipes in docs/demos.md. - docs/demos.md (new): per-demo writeups for demo1, demo2, one_sprig_test, and worm_alone_half_resolution. Each includes parity numbers, the SGD-tuned Metal parameters, and a "Regenerate locally" block with the exact commands. - docs/sibernetic_demos_combined.gif (new): 4-second 12 fps loop combining the four demo MP4s as a 2x2 grid (1000x1000), built with a two-pass palette pipeline. - docs/worm_alone_opencl_vs_metal.mp4 (new): the 4-panel worm_alone comparison MP4 (top row iso, bottom row closeup, left OpenCL, right native Metal). - src/metal_diff/render_worm.py (new): worm-config trajectory renderer with auto-format detection; supports iso, closeup, and cross views; CLI knobs for zoom, z-focus-frac, z-extent-frac, show-edges, focus-on-worm. - src/metal_diff/sgd_worm.py (new): FD-SGD harness over the Metal trajectory CLI for worm configs, with per-frame elastic-only L2 loss against an OpenCL reference trajectory.

Worm immersed in 80 929-particle liquid bath — the next-up scale from worm_alone (which sat on dry boundary). Worm shape preserved within 2 % over 25 ms, mean per-particle position drift ≤ 3.1 sim units by end of trajectory (closer earlier: Δ = 0.5 at 1 ms, 0.6 at 5 ms, 0.8 at 10 ms). Buoyancy comes from XPBD's density-constraint solver firing when local density exceeds rho_rest = 4e-13 (the natural water density in our units, kg/sim_unit³). Two settings dominate: - rho_rest = 4e-13 (was 8e-13, above natural density — solver never fired). Below natural density makes the constraint persistently active. - alpha_dens = 1e3 (was 1e-3, too soft to compete with the mass/rho² term in the XPBD denominator). 1e3 makes the per-iteration position correction a meaningful fraction of the density excess. Substrate-level work (M14 — Ihmsen 2010 boundary handling): - shaders.metal — new kernel boundary_position_correction. Sums weighted boundary normals from sorted_static neighbors, projects active particles along the resulting direction by w2_sum/w_sum. Dispatched both inside the inner XPBD loop (3× per step) and after update_velocities to keep velocity changes bounded. - metal_common.{h,mm} — build_static_grid extended to optionally sort a parallel aux array (boundary normals) using the same permutation it applies to positions. Caller frees both. - load_config.py — extracts boundary "velocities" (which are inward-pointing surface normals per Sibernetic config convention, sphFluid.cl:791) and writes them un-sorted; the C++ build_static_grid sorts them alongside positions in one pass. - ops_xpbd_step.mm — new args bound_r0, bound_gain, path_bound_normals (argv[42..44]); seeds bufDens to rho_rest when use_pressure is on (was only firing when use_pair_forces was on, leaving pressure reading uninitialized memory and NaN-ing). M14 is in place but is NOT what fixes worm_swim — the buoyancy fix is the rho_rest/alpha_dens combination above. M14 retroactively improves worm_alone (replaces its floor_y=2.25 hack with real Ihmsen normals) and is the substrate-level groundwork for any scenario where boundary repulsion matters. Argv encoding fix in dump_metal_trajectory.py: vel_clamp + box_clamp slot reservation now expands to also reserve when M14 args follow. SGD harness extensions (sgd_worm.py): linear-space updates for bounded params (floor_y), oscillation-detect-and-damp, alpha_dens trainable. Skill update (~/.claude/skills/sibernetic-demo-port/SKILL.md): documents that build-what's-needed is the rule, not "ship with caveats and ask permission to keep working".

Extended the implementation to support Metal alongside OpenCL.

- docs/worm_alone_opencl_vs_metal.gif: 1.4 MB, 12 fps, 600px wide. worm_alone has had an MP4 in docs/ since the original port but no embeddable preview; demos.md now displays it inline at the top of the section (matching demo1 / demo2 / one_sprig_test convention). - docs/worm_swim_opencl_vs_metal.gif: 3.6 MB, 8 fps, 400px wide. worm_swim is bigger because the 80 929 swirling water particles produce huge inter-frame variation; even at the tightest dither and 64 colors a smaller-than-this gif wasn't readable. Embedded in demos.md. - docs/sibernetic_demos_combined.gif: rebuilt as a 5-panel mosaic (was 4). Top row: demo1 / demo2 / one_sprig_test. Bottom row: worm_alone / [white spacer] / worm_swim. Same 4-second loop. Also bumped alt text in README.md to mention all 5. - .github/workflows/ci-build.yml: pip-install numpy on the macOS runner before the Native-Metal smoke test. The smoke test imports load_config which `import numpy as np`, but macos-latest's system python3 doesn't ship numpy by default — caused every push since the worm_alone port to fail this step. Use --break-system-packages because Python 3.13+ on the runner treats the system interpreter as PEP-668 externally-managed.

load_to_metal_buffers() looks up `configuration/<scenario>` relative to cwd. The previous fix to the macOS smoke test (install numpy) revealed this second issue: cd-ing into src/metal_diff/ before the python heredoc broke the configuration/ lookup. Pre-locate the binary as $SIB_METAL, keep cwd at repo root, and reference both via env.

slarson and others added 30 commits June 28, 2025 19:29

Limit energy test iterations

633bb44

Adjust OpenCL build flags and shorten energy test

a365ea6

Enable configurable iterations for test mode

9e395d0

Merge pull request #192 from openworm/codex/create-unit-tests-for-eng…

934cb84

…ine-validation Adjust energy test iterations and add reference logs

Update makefile

6158beb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'ow-0.9.8' into 75pgw2-codex/create-unit-tests-for-engin…

b00bc04

…e-validation

Merge pull request #193 from openworm/x2nqc4-codex/create-unit-tests-…

c98b21f

…for-engine-validation Limit energy test to 50 iterations

Merge pull request #194 from openworm/75pgw2-codex/create-unit-tests-…

e3df14a

…for-engine-validation Allow shorter test runs

Add repository guide and module docstrings

4832ac6

Merge pull request #195 from openworm/codex/generate-agents.md-and-im…

9a1f2b9

…prove-documentation Add repository guide and docstrings

Make tests self-contained with OpenCL

c92ba16

Merge pull request #196 from openworm/codex/enable-opencl-installatio…

ec6d268

…n-for-tests Ensure OpenCL tests run

Fix OpenCL solver destructor

ea06f5d

Merge pull request #197 from openworm/codex/add-unit-tests-for-opencl…

9b297ce

…-solver Fix solver destructor segfault

Remove duplicate import

2b72003

Merge pull request #198 from openworm/codex/add-unit-tests-for-opencl…

151d171

…-solver-with-logs Add reference solver logs for tests

Add setup script and update docs

9522b6f

Update pytorch_solver.py

6688d18

performance optimization: loops vs vectorization Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update pytorch_solver.py

0febf02

More performance optimization by vectorizing Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge pull request #199 from openworm/codex/create-pytorch_solver.py-…

24931bf

…and-implement-tensor-operations Add PyTorch solver, unit tests, and installation scripts / instructions.

Enable OpenCL CPU driver and subprocess fallback

1c4e35f

Merge pull request #200 from openworm/codex/investigate-worm_motion-l…

7c016f1

…og-file-issue Ensure OpenCL runtime and subprocess fallback

Add optional PyTorch solver backend

bf873a6

Merge pull request #201 from openworm/codex/integrate-pytorch_solver-…

c0024e4

…with-pybind11 Enable optional PyTorch solver

Update ci-build.yml

e7fcec8

add install for pytest

Handle header lines in log loader

5c472f0

Merge branch 'ow-pytorch-0.0.1' into codex/update-readme-and-agents.m…

c37e5e8

…d-for-pytorch-solver

Fix torch backend velocity column

6968339

Merge pull request #208 from openworm/codex/fix-test-failure-in-test_…

07dc1ec

…torch_backend Fix torch backend velocity output

test: relax torch backend tolerance

0e8834b

slarson and others added 30 commits May 4, 2026 02:40

Add Metal support to the implementation

161d60a

Extended the implementation to support Metal alongside OpenCL.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ow native gpu 0.1.0 -> 0.9.9#230

Ow native gpu 0.1.0 -> 0.9.9#230
pgleeson wants to merge 124 commits into
ow-test-mergefrom
ow-native-gpu-0.1.0

pgleeson commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pgleeson commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants