Skip to content

Ow native gpu 0.1.0 -> 0.9.9#230

Open
pgleeson wants to merge 124 commits into
ow-test-mergefrom
ow-native-gpu-0.1.0
Open

Ow native gpu 0.1.0 -> 0.9.9#230
pgleeson wants to merge 124 commits into
ow-test-mergefrom
ow-native-gpu-0.1.0

Conversation

@pgleeson
Copy link
Copy Markdown
Member

No description provided.

slarson and others added 30 commits June 28, 2025 19:29
…ine-validation

Adjust energy test iterations and add reference logs
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…for-engine-validation

Limit energy test to 50 iterations
…for-engine-validation

Allow shorter test runs
…prove-documentation

Add repository guide and docstrings
…-solver-with-logs

Add reference solver logs for tests
performance optimization: loops vs vectorization

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
More performance optimization by vectorizing

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…and-implement-tensor-operations

Add PyTorch solver, unit tests, and installation scripts / instructions.
…og-file-issue

Ensure OpenCL runtime and subprocess fallback
…with-pybind11

Enable optional PyTorch solver
add install for pytest
…torch_backend

Fix torch backend velocity output
slarson and others added 30 commits May 4, 2026 02:40
…ll_bwd

Adds param-gradient kernels and host accumulators so SGD over fluid
params no longer requires finite differences (which were noisy in
the chaotic snapshot-loss regime).

New kernels:
- spring_K_partial: per-particle scalar = <∂(spring_accel_i)/∂K, ga_i>
- visc_K_partial:   per-particle scalar = <∂(visc_accel_i)/∂visc_K, ga_i>

Both read bGext (= dt · dL/dv_after_apply) and per-particle inputs
(positions, velocities, density). Host sums per-particle outputs into
total_grad_spring_K / total_grad_visc_K accumulators. xpbd_full_bwd
now writes these as scalar files at optional argv[21], argv[22].

Validation (test_param_grads.py): toy 6-particle chain.
- spring_K analytic vs FD: rel err 7.8e-4 ✓ PASS
- visc_K analytic vs FD: rel err ~17% (FAIL the 10% threshold)

The visc_K rel err is the **stop-gradient through density** missing
chain — visc force scales as 1/ρ_i, and ρ_i depends indirectly on
visc_pair_coef through positions across timesteps. Treating ρ as
stop-gradient captures the dominant direct-dependence path; the
implicit density-perturbation path adds the remaining ~17%. For
SGD purposes, having gradient correct to ~80% still points in the
right direction.

Also adds:
- sgd_tune.py — original FD-based SGD harness (useful baseline)
- random_search_tune.py — random-search fallback for chaos basins
- sgd_analytic.py — wrapper that uses production xpbd_step for fast
  forward eval, FD for params (analytic chain via xpbd_full_bwd is
  available but xpbd_full at K=50000 is slow due to dense N²)

Empirical findings during exploration:
- The cube-fall snapshot at t=1s is chaotic — small param changes
  can shift bounce phase, making finite-diff gradients noisy
- Random search found best L=4.3e-3 (5.3% Δm_y err, 3.9% extent err)
- SGD from baseline diverged at lr=0.05 (5% step crossed chaos cliff)
- True path to <1% requires either:
  (a) smoother loss formulation (time-averaged or settled-state metrics)
  (b) running longer until cube settles past bouncing
  Both are research directions, not bug fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed BPTT

Three changes wiring up the analytic-gradient SGD pipeline:

1. Floor constraint (solve_floor_constraint_with_mask + backward) wired
   into xpbd_full_fwd/bwd. New optional [floor_y] arg matches xpbd_step's
   floor handling so gradient flow goes through the same physics.
   - per-step state extended by 1*n_active int32 (clamped mask)
   - forward saves mask, backward restores + dispatches floor_bw

2. predict_positions_backward bug fix: was missing sim_scale_inv factor
   on velocity gradient. Forward is `pos += v · dt · sim_scale_inv`;
   backward `dL/dv = grad_x_pred · dt · sim_scale_inv`. Old code did
   `dL/dv = grad_x_pred · dt` only — under-counted velocity gradient by
   135135x when sim_scale=7.4e-6 (Sibernetic units). All 4 callers
   updated to pass sim_scale_inv buffer (toy ops pass 1.0).

3. Truncated BPTT support: env var BWD_TBPTT=N walks back only the
   last N steps in xpbd_full_bwd's reverse loop. Default = K (no
   truncation, current behavior).

   Empirically observed: gradient through chaotic xpbd_full backward
   amplifies ~3× per step. K=10 gives g_rho ~1e7, K=200 gives ~1e25,
   K=500 overflows fp32 to NaN. This is FUNDAMENTAL to backprop through
   chaotic dynamics — not a bug.

   Standard ML technique for chaotic systems is TBPTT (truncated
   backprop through time). Now available as `BWD_TBPTT=N` env var.
   Combine with gradient clipping in the SGD harness for stability.

All existing regression tests still pass:
- test_xpbd_full (∂L/∂ρ rel err 3e-5)
- test_xpbd_full_visc (with pair_forces)
- test_xpbd_full_spring (with springs)
- test_param_grads (spring_K rel err 7.8e-4; visc_K 17% off due to
  stop-gradient through density)

For SGD on the demo1 cube-fall:
- Forward + backward cost: K=50000 → 60s + 480s = ~9 min/iter
- TBPTT=N reduces backward to N/K of full cost
- Gradient explosion still requires careful loss formulation or
  TBPTT + clipping; demo1's bouncing-cube snapshot loss is at the
  edge of what's tractable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backprop through 50000 steps of chaotic xpbd dynamics overflows fp32
(gradient amplifies ~1000× per step in this regime). Add:

- --tbptt N: env-var driven truncated BPTT window (xpbd_full_bwd
  walks only last N steps in reverse loop)
- --clip-norm: clip global L2 norm of log-space gradient
- NaN/Inf filter: replace non-finite components with 0 to avoid
  parameter blow-up if TBPTT N is still too large

Empirical regime for demo1 cube fall:
- TBPTT=3-5: gradients finite, clipped from ~1e8 (TBPTT=3) to
  ~1e18 (TBPTT=5). Direction preserved by clipping.
- TBPTT=8+: NaN — fp32 overflow during backward.

Tradeoff: TBPTT=5 only credits parameter influence on the LAST 5
steps of dynamics (sim_time = 100 µs of the 1.0 s run). Convergence
will be slow because parameter→trajectory effects further back are
invisible. Mitigates the chaos, doesn't escape it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end SGD using xpbd_full_fwd + xpbd_full_bwd analytic chain
(no FD, no random search). Captures:
- 4-part substrate work (param grads, floor, predict_bw bug fix, TBPTT)
- 12-step SGD trajectory: loss dropped 50× (0.147 → 0.0023)
- Best simultaneous metrics: Δm 2.88%, ext 3.86% at step 10
- Best ext alone: 1.91% at step 11
- Pareto frontier observed; no single point under 1% on both
- Honest discussion of chaos-driven gradient explosion + TBPTT remedy

Substrate has correct analytic gradients now. Reaching sub-1% on both
metrics simultaneously requires loss reformulation (settled-state,
time-averaged), which is the natural follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three SGD runs total (12 + 15 + 10 iters):
- Loss reduced 140× from start (0.147 → 0.0011)
- Extent metric reached 0.01% (essentially perfect on this metric)
- Δm settled around 3-3.3% — chaos prevents further reduction
- Pareto trade-off persists; no point under 1% on BOTH simultaneously

Surprise finding: SGD-optimal params for xpbd_full DON'T transfer to
xpbd_step. Same algorithm, different fused-vs-unfused fp32 rounding,
different bounce phase at t=1s under chaos. Documented honestly with
three concrete next-step paths for cross-substrate transfer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original demo1 puts the cube at box geometric center (44.42, 44.42,
44.42) but the boundary particles are not at perfect bounding-box
center — the inner-face geometric center is at (43.59, 43.59, 43.59).
The 0.83-unit offset creates an asymmetric pressure field that nudges
the cube horizontally, so it slides across the floor as it falls
instead of dropping straight down.

This config shifts the cube (elastic + liquid) by (-0.83, 0, -0.83) so
the cube starts at (43.59, 44.42, 43.59) — directly above the inner
geometric center. With this fix the cube falls straight down without
sliding, and visual comparisons across substrates show clean cube-fall
behavior instead of chaotic horizontal drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User caught visually that demo1's cube was sliding across the floor
instead of falling straight down — a quirk of the boundary particles
being asymmetric (cube at box geometric center is NOT at boundary
inner geometric center; off by 0.83 units). Lateral pressure gradient
nudges the cube horizontally, then chaos amplifies into "hockey-puck-
on-ice" behavior that LOOKED clearly broken on visual inspection but
was passing all our existing numerical checks.

Three things in this commit:

1. configuration/demo1_centered — cube shifted by (-0.83, 0, -0.83)
   so it starts at boundary geo center (43.59, 44.42, 43.59). With
   the cube symmetrically placed, it falls straight down under gravity.

2. tests/test_cube_fall_trajectory.py — codifies the "looks correct"
   visual check as automated trajectory-equivalence assertions:
     - Initial cube position matches expected (within 1 unit)
     - Horizontal drift over the trajectory < 5 units (no slide)
     - Final cube min_y near floor (lands)
     - Y-extent retention in [0.5, 2.0] (no pancake, no explode)
     - Cube center y stable in last 0.5s (settled, not still falling)

   Runs as `python3 tests/test_cube_fall_trajectory.py path/to/position_buffer.txt`.
   Reads either Sibernetic's variable-frame format (boundary only in
   frame 0) or our normalized fixed-frame format. Same pass/fail
   criteria for both — a substrate is equivalent to the gold standard
   if and only if the trajectory check passes.

   Validated against three datasets:
   - Metal substrate, no springs, no pair_forces (centered cfg) → PASS
     (drift 1.34, extent retention 0.96)
   - Metal substrate, no springs, WITH pair_forces (centered cfg) → FAIL
     (drift 52.7, caught the slide that visual inspection found)
   - Sibernetic local Taichi-Metal (centered cfg) → FAIL
     (extent retention 0.00, caught the known Taichi pancake bug)

3. scripts/render_cube_fall_compare.py — side-by-side MP4 renderer.
   Reads two position_buffer.txt files, samples N matched sim times,
   renders 3D scatter plots with fixed box-bounds camera so visual
   comparison shows the actual cube trajectory (not autocamera-zoomed
   onto whatever's left).

   `python3 scripts/render_cube_fall_compare.py out.mp4 left.txt right.txt "Left Label" "Right Label"`

The comparison movie at ~/Downloads/cube_drop_compare.mp4 shows
Sibernetic Taichi (pancakes) vs Metal Native (intact) side by side.
Both are running on demo1_centered so neither slides — only their
intact-vs-pancake behavior differs.

Note on cross-substrate scope: Sibernetic OpenCL gold standard requires
Cloud Run; the baked image doesn't have demo1_centered yet (would need
a redeploy with the new config baked in). For now this comparison uses
Sibernetic local Taichi-Metal which has the known elastic-coordinate-
scale bug. Once Cloud Run is redeployed with demo1_centered, swap the
LEFT panel for OpenCL output for the gold-standard reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Course-correction: the right regression test for substrate equivalence is
"Metal-on-demo1 vs OpenCL-on-demo1 within tolerance," NOT "cube falls
straight under idealized physics." demo1's boundary is asymmetric
(3459 -x face vs 3671 +x face); both backends slide as a result. They
should slide the *same way* — that's equivalence.

Removes:
  configuration/demo1_centered      (engineered around demo1's quirks)
  tests/test_cube_fall_trajectory.py (asserted idealized physics)

Adds:
  tests/test_demo1_backend_parity.py — diffs two trajectories from same
    config: per-particle L2, cube-center diff, cube-extent diff, init
    + final agreement. Reads optional .times.txt sidecar for variable
    chunk sizes.
  scripts/render_demo1_parity.py — pyvista shaded-sphere renderer with
    quad time-warped sampling so the 8ms physical descent fills 30+ video
    frames (no teleport).
  src/metal_diff/dump_metal_trajectory.py — variable-chunk schedule
    (--dense-steps + --dense-chunk for descent, --chunk for settle) and
    per-frame .times.txt sidecar.

First parity run on demo1 dense (logstep=20): mean per-particle L2=53,
max cube-center diff=96 — substrates have NOT converged. That's the
honest baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default was 2.5e-13 (toy-test convention), but Sibernetic configs use
1000 kg/m³ (water density, per inc/owPhysicsConstant.h: const float
rho0 = 1000.0f). With wrong rho_rest the density-constraint solver
massively overcorrects: cube fell 39 sim units in 2ms (vs OpenCL's
7ms), then bounced violently to the top of the box and oscillated.

With rho_rest=1000 (correct value):
  - cube falls under gravity at the right rate (8.4ms to floor)
  - lands with normal squash (extent_y 10 → 4.8)
  - settles at cy=1.47, drifts only 0.135 sim units over 25ms

Parity test on dense data:
  before: mean per-particle L2 = 53.8, cube-center diff = 95.7
  after:  mean per-particle L2 =  6.2, cube-center diff = 10.4

Still fails strict thresholds (Metal compresses more, lands ~1ms early)
but the substrate is now in the right ballpark.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lags

CLI overrides for trajectory targets (so we can match OpenCL at 25ms
instead of the hardcoded long-trajectory targets) and per-parameter
freeze flags so we can tune just spring_K without touching rho or v.

Tuning result for demo1 25ms cube fall (OpenCL parity):
  K=2220 (Sibernetic default): ext_y=4.79 at landing (over-compressed)
  K=5500 (best): ext_y=9.46 at landing
  K≥9500: NaN divergence (XPBD constraint solver unstable)

Beyond K=5500 the cube extent saturates at ~9.46 — substrate-level
limit. Remaining gap to OpenCL's ext_y≈11.2 is XPBD vs PCISPH on-impact
behavior; not closable by spring stiffness alone.

Parity test on demo1 0-25ms with K=5500:
  before tune (K=2220): center_diff=10.4, ext_diff=7.6
  after tune  (K=5500): center_diff= 7.1, ext_diff=5.1
  4/5 parity checks now pass; only ext_diff above threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When inspecting short timescales (e.g. 0-25ms cube descent) time-warped
sampling creates the illusion of "skating" because samples jump from
e.g. 12ms to 86ms to 759ms. Uniform mode samples evenly across [0,t-max]
so motion appears at correct speed.

Also --hide-liquid filters type-1 liquid particles from the render —
demo1 is elastic shell + 125 liquid particles inside; when liquid leaks
out at impact, it visually obscures the cube.

Use:
  --uniform                 # disable quad time-warping
  --t-max 0.025             # show only first 25ms
  --hide-liquid             # only render the elastic cube particles

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pping

Two related fixes that together make xpbd_full_bwd return stable finite
gradients on demo1, where it previously produced NaN within 5 chain
steps regardless of TBPTT setting.

Fix 1 (kernel): density_constraint_grad_backward divided by `r` in
g_hat = dir/r and h_minus_r/r factors with no epsilon. Particles in
contact (r → 0 post-impact) produced inf/NaN. Replaced with r_safe =
max(r, R_MIN) where R_MIN = 1e-4 sim units, and skip pairs with
r2 < R_MIN². Other backward kernels already had similar guards.

Fix 2 (host): xpbd_full_bwd chain rule amplifies position gradients
by ~sim_scale_inv (1.35e5) per step because update_vel backward
divides by dt and predict backward multiplies by dt·sim_scale_inv.
With Sibernetic's mass=2e-12, gradients overflow to NaN within 5–10
chain steps — fundamental property of differentiable chaos with
microscale parameters, not a bug. Added BWD_CLIP_NORM env var: per
chain step, zero NaN/Inf and L2-clip running grad_x and grad_v to
the cap. Yields biased-but-bounded gradients usable by Adam in
log-space.

Verification (TBPTT scan on demo1 1250 steps):

  Before:
    TBPTT=1:    g_K=-3e-9   g_v=NaN   gx0_NaN=414/1029
    TBPTT=5:    g_K=NaN     g_v=NaN   gx0_NaN=1029/1029

  After (BWD_CLIP_NORM=1e3):
    TBPTT=1:    g_K=-3e-9   g_v=+2.4e-4   gx0_NaN=0/1029
    TBPTT=100:  g_K=-1.4e3  g_v=-1.5e8    gx0_NaN=0/1029
    TBPTT=1250: g_K=-5.1e4  g_v=+9.7e10   gx0_NaN=0/1029

SGD on K + v with the fixed backward (TBPTT=100, lr=0.05, BWD_CLIP_NORM=1e3):
  step 0:  L=0.0354 K=5500
  step 9:  L=0.0339 K=8626 v=5.12e-5  (best, 4% loss reduction)
  step 11: K crossed stability boundary (≥9500 → forward NaN)

SGD confirms manual sweep finding: optimum sits on the K-saturation
plateau (5500-8625), right at the stability cliff. Remaining 16%
extent gap to OpenCL is algorithmic (XPBD vs PCISPH on impact),
not closable by spring tuning alone.

Also adds src/metal_diff/sgd_fd.py — finite-difference SGD as a
fallback for parameters not exposed in xpbd_full_bwd analytic
gradients.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends solve_density_constraint_backward to also compute ∂L/∂A where
A = alpha_inv_dt2 (= alpha_dens / dt²). Math: with Δλ = -(C + A·λ_pre)/D,
the derivative simplifies to

    ∂Δλ/∂A = -(λ_pre + Δλ) / D = -λ_post / D

so per-particle gradient = chain · (-λ_post/D), where chain = (ω·g)/m + ψ.
Host sums per-particle contributions, divides by dt², writes to argv[24]
of xpbd_full_bwd. sgd_true.py now learns alpha_dens via Adam alongside
rho/K/v with --alpha-init / --freeze-alpha CLI flags.

FD validation:
  - Per-particle kernel test (test_dens_alpha_grad.py) on contrived inputs
    where C > 0 for all particles: rel_err = 1.9e-4 vs centered FD. PASS.
  - Existing solve_density_constraint backward (rho/grad_C/etc) still
    passes after kernel signature change (buffer index 12 added for
    grad_alpha_inv_dt2_per_particle, downstream constants shifted).

Practical note for the current demo1 tuning: with rho_rest=1000 the
density solver is effectively disabled (computed densities are O(1e-12)
in sim-density units, so C = density/rho_rest - 1 ≈ -1 always; one-sided
projection skips). In that regime ∂L/∂alpha_dens = 0 — analytic gradient
correctly identifies alpha_dens as a non-tunable knob. To make alpha_dens
matter for tuning, rho_rest would need to be in the same sim-density
units (~1e-13 to 1e-11) where C > 0 triggers the solver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… parity fix

Two parity-narrowing changes for Metal-vs-OpenCL on demo1:

1. ELASTIC FLOOR. Replaced inelastic clamp (x.y = floor_y) with reflective:
     if (x.y < floor_y):
       δ = floor_y - x.y
       x.y = floor_y + e · δ
   e ∈ [0,1]: 0 = legacy clamp, 1 = perfectly elastic bounce.
   Forward kernel solve_floor_constraint and the masked variant
   solve_floor_constraint_with_mask both now take a restitution
   constant. Backward kernel updates the gradient accordingly:
     ∂post.y/∂x.y = -e   (was 0, killing all y-gradient on clamped)
     ∂post.y/∂floor_y = 1 + e
   Threaded through xpbd_step, xpbd_full_fwd, xpbd_full_bwd, and the
   standalone step_floor_fwd/bwd test ops.

2. FLOOR HEIGHT MATCHING. OpenCL's PCISPH boundary band makes the cube
   sit on TOP of boundary particles at y≈1.06, not at y=0. Metal's
   default floor_y=0 puts the cube ~1 unit lower than OpenCL. Setting
   floor_y=2.0 in dump_metal_trajectory.py raises Metal's cube to
   match OpenCL's effective floor location (cy=6.64 vs OpenCL 6.68).

Parity test results on demo1 0-25ms:

  K=2220 (Sibernetic default):       4 fails (only init passes)
  K=5500 + rho fix:                  1 fail (cube-extent), mean L2=4.86
  K=5500 + floor_y=2.0:              1 fail (cube-extent), mean L2=4.01

  17% improvement in mean per-particle L2; max cube-center diff dropped
  from 7.08 to 5.15.

Remaining gap (cube_extent_diff=5.39) is a substrate-level difference
NOT closable by parameter tuning: OpenCL's PCISPH performs an
axis-asymmetric initial expansion via the density solver in the first
0.4 ms (ext_x=12.30, ext_y=12.36, ext_z=9.40, from initial 10.02). Our
XPBD substrate's density solver is inactive at rho_rest=1000 (computed
densities are O(1e-12) in sim units → C ≤ 0 always). To replicate
OpenCL's puff-up, the density solver would need to operate in matching
sim-density units AND remain stable (it explodes at rho_rest=2.5e-13).
Future work: separate "soft" boundary repulsion that mimics OpenCL's
density-driven cube expansion without enabling full PCISPH dynamics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble substrate

Removes the auxiliary backend implementations (Taichi-Metal/CUDA/CPU and
PyTorch-CPU/MPS/CUDA) and consolidates Sibernetic on two paths: the
OpenCL gold-standard PCISPH solver and the new native Metal XPBD
substrate at src/metal_diff/sib_metal. The PyTorch and Taichi solvers
were never reliable past the cube-only elastic case (Taichi pancake
bug, PyTorch CPU intractably slow, MPS autodiff coverage gaps); their
intended role — differentiable physics for parameter learning — is now
served by the native Metal substrate's hand-derived analytic backward
kernels.

Files removed:
  taichi_solver.py                  (1823 lines)
  pytorch_solver.py                 (1589 lines)
  differentiable_solver.py          (747 lines)
  tests/test_pytorch_solver.py
  tests/test_torch_backend.py
  tests/test_physics_parity.py      (PyTorch ↔ OpenCL parity)
  tests/test_cpp_integration.py     (PyTorch C++ bridge)
  tests/test_validation.py          (PyTorch validation)
  tests/test_cube_stability.py      (cross-backend; only OpenCL/Metal left)
  tests/conftest.py                 (PyTorch fixtures only)

C++ purge:
  inc/owConfigProperty.h            : useTaichi/taichiDevice/getTaichiDevice/torchEnabled/getTorchDevice
  inc/owPhysicsFluidSimulator.h     : torchSolver/taichiSolver/init*Solver, Python.h include
  src/owConfigProperty.cpp          : backend=taichi*/torch* parsing → opencl-only
  src/owPhysicsFluidSimulator.cpp   : initTorchSolver/initTaichiSolver bodies (-249 lines),
                                      simulationStep torch+taichi paths (-119 lines),
                                      destructor torch+taichi cleanup, Python.h/numpy includes
  makefile.OSX                      : misnamed TORCH_FLAGS → ARCH_DEFINES (it was always
                                      just -DOW_NO_OPENCL on Apple Silicon)

CI workflow update (.github/workflows/ci-build.yml):
  build-macos job replaced. Old: build main binary + Taichi-CPU + Taichi-Metal
  smoke tests. New: build main binary (build-system regression check) + build
  src/metal_diff/sib_metal + run xpbd_step on demo1 with sanity checks
  (no NaN/Inf, particles within box). The native Metal substrate is the
  actual GPU-simulation path on Apple Silicon, so this is what should be
  exercised in CI.
  Linux jobs (ubuntu-22.04, ubuntu-24.04): dropped torch/taichi pip installs.

Documentation:
  README.md                  : full rewrite. New "Two solver paths" overview,
                               PCISPH→XPBD transition explained, dedicated
                               "Differentiable physics with native Metal"
                               section covering what's differentiable, why
                               it's useful, how to use sgd_true.py and the
                               BWD_TBPTT/BWD_CLIP_NORM env vars, and the
                               FD-validation tests.
  DEVELOPMENT_LOG.md         : new top-level "Native Metal substrate"
                               section summarizing M1-M9 work, parity findings
                               (rho_rest unit fix, floor_y=2.0, K plateau,
                               elastic floor, alpha_dens analytic backward),
                               performance numbers. Existing PyTorch/Taichi
                               history retained as design context.
  src/cuda/README.md         : drop "Why not just use Taichi-CUDA" section
                               (the answer is now: we don't have Taichi
                               anymore, period).
  src/metal_diff/README.md   : "Why hand-written" section updated.
  AGENTS.md                  : remove backend=torch CLI example.

Setup + tests:
  setup.sh                   : drop pip install of torch/taichi; install only
                               numpy + pyneuroml.
  run_all_tests.sh           : drop torch/pytorch pytest invocations; run
                               only test_energy.py + test_solver_logs.py.
  scripts/cross_backend_regression.py : default backend list now [opencl];
                                         docstring updated to reference
                                         src/metal_diff/sib_metal as the
                                         local-binary parity comparison.

Build verification: make all -f makefile.OSX produces ./Release/Sibernetic
on Apple Silicon (with OW_NO_OPENCL); ./Release/Sibernetic -h prints usage.
src/metal_diff/build.sh produces sib_metal binary; existing FD-validation
tests (test_xpbd, test_solve_dens_bwd, test_dens_alpha_grad) still pass.

Net: -7,343 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The monolithic sib_metal.mm became hard to navigate as the substrate
grew. Reorganised into smaller files grouped by op category:

  metal_common.{h,mm}    208 lines  shared utilities: MetalCtx, make_ctx,
                                    make_pso, load_shader_source, GridDim3
                                    + GridOrigin3, build_static_grid
  ops_kernels_m6.mm      515 lines  atomic M6 kernel ops (FD-validated
                                    building blocks: dist_*, wpoly6,
                                    rowsum, density_constraint_grad)
  ops_xpbd_step.mm       565 lines  M7 imperative pipeline (xpbd_step)
  ops_xpbd_full.mm      1247 lines  differentiable pipeline (xpbd_full_fwd
                                    + xpbd_full_bwd) — the core of the
                                    differentiable substrate; sgd_true.py
                                    drives these two ops
  ops_pair_spring.mm     368 lines  pair_forces + spring_bonds, fwd+bwd
                                    (FD-validated test ops)
  ops_test_steps.mm      922 lines  step_simple/floor/bond fwd+bwd
                                    (kernel-chain test ops)
  ops_test_density.mm    734 lines  density-pipeline unit-test ops
                                    (density_as/aa, density_constraint
                                    grad bwd, solve_density_constraint
                                    fwd+bwd)
  sib_metal.mm           106 lines  slim main + op dispatcher with
                                    extern decls of every run_* op

Total split content: 4685 lines across 9 files (the +115 over the
original 4570 is per-file headers and module-explainer comments).

Mechanical refactor: all run_* function bodies are byte-identical to
the originals; only `static` qualifiers were stripped where needed for
cross-TU linkage, and shared types (MetalCtx, GridDim3, GridOrigin3)
moved from .mm to metal_common.h.

build.sh updated to compile all .mm files together. Verification:

  cd src/metal_diff && ./build.sh    # produces sib_metal binary
  python3 test_xpbd.py               # PASS (cube integrity)
  python3 test_solve_dens_bwd.py     # PASS (FD-validated rho_rest grad)
  python3 test_dens_alpha_grad.py    # PASS (FD-validated alpha grad)
  python3 test_learn_floor.py        # PASS (SGD converges)

End-to-end demo1 cube-drop trajectory matches pre-split output exactly:
parity test against OpenCL reference gives identical L2 values
(mean=4.01, center_diff=5.15, ext_diff=5.39 — same 4/5 checks pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds docs/cube_drop_demo1_25ms.mp4 (880 KB, 60 frames at 15 fps) — the
side-by-side OpenCL-vs-Metal comparison on demo1's first 25 ms, with
floor_y=2.0 + spring_K=5500 tuned. README links to it as a "Preview"
callout near the top so contributors see the substrate behaviour without
having to rebuild and run.

Three new README sections:

1. **Quickstart: render the demo1 cube drop**. Step-by-step walk-through
   of how to regenerate the bundled MP4 from scratch on Apple Silicon —
   build both binaries, run sib_metal on demo1 (1250 steps, 25 ms),
   fetch OpenCL reference, render side-by-side via render_demo1_parity.py.

2. **Parity status and caveats vs OpenCL**. Promotes the previous
   single-paragraph status to a structured caveats list:
   - cube_extent_diff still fails — XPBD's one-sided density constraint
     doesn't replicate PCISPH's initial cube puff-up
   - long-timescale floor sliding diverges between substrates after
     ~100 ms (both physically valid for demo1's microscale)
   - differentiable backward gradients are biased due to per-step
     gradient clipping (essential for stability through chaotic dynamics)
   - membranes not yet ported to Metal substrate (demo2 unsupported)
   - no worm/c302 bridge in Metal substrate yet
   Each caveat links to the relevant kernel/file.

3. **Path to a native CUDA backend**. Details the strategic plan to
   mirror src/metal_diff/ (not the legacy owSolver abstract base) for
   the eventual NVIDIA-native substrate. Layout sketch, MSL→CUDA
   translation rules, 5-phase build plan (~2 weeks), reasoning for
   choosing CUDA over OpenCL on NVIDIA hardware.

Also updates src/cuda/README.md to align with the metal_diff-mirror
strategy: the previous "mirror PR #222 abstract base" plan has been
replaced with "mirror metal_diff" since metal_diff already worked out
the differentiable contract (paired forward/backward kernels, shared
metal_common.h, thin extern dispatch). The previous file's
phase-by-phase implementation plan stays — just the target architecture
changed.

Net: README +246 lines, src/cuda/README +106 lines, MP4 +880 KB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub markdown does not auto-render MP4 links inline, but it does
render GIFs. Added docs/cube_drop_demo1_25ms.gif (690 KB, 12 fps, 600 px
wide, palette-optimized via ffmpeg) so the cube-drop comparison plays
inline in the rendered README. The MP4 is preserved as the
higher-quality download below.

Generated via:
  ffmpeg -y -i docs/cube_drop_demo1_25ms.mp4 \
      -vf "fps=12,scale=600:-1:flags=lanczos,palettegen=stats_mode=full" \
      /tmp/palette.png
  ffmpeg -y -i docs/cube_drop_demo1_25ms.mp4 -i /tmp/palette.png \
      -lavfi "fps=12,scale=600:-1:flags=lanczos[x];[x][1:v]paletteuse=dither=bayer:bayer_scale=5" \
      docs/cube_drop_demo1_25ms.gif

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n service

The runner at ~/Documents/sibernetic-runner/ is slarsontech-private
infra and shouldn't be referenced from the public openworm/sibernetic
repo. Cleanup:

- scripts/cross_backend_regression.py: rewritten as a local-only
  comparator. Strips the hardcoded URL + API key, drops Cloud Run
  client logic (submit_cloud_run, poll loop, GCS result fetch). New
  CLI takes one or more --binary "NAME=PATH" args; first PASS run
  becomes the comparison baseline. Cube-stability evaluation logic
  unchanged.

- README.md: re-worded the OpenCL-reference fetch step in the
  "Quickstart: render the demo1 cube drop" section to point at a
  local Linux OpenCL run rather than the private runner. Performance
  table no longer mentions specific Cloud Run latencies.

- DEVELOPMENT_LOG.md: historical perf entries kept (the L4 numbers
  came from real measurements), but the specific service ("Cloud
  Run", "sibernetic-runner", "Cloud Run image") is genericized to
  "remote L4 GPU" / "remote runner".

- src/cuda/README.md: "Why not just use OpenCL on NVIDIA?" section
  reworded.

- src/metal_diff/sgd_tune.py: docstring reference updated.

Note: the API key `sibernetic-runner-2026` and runner URL were
previously committed to scripts/cross_backend_regression.py and
pushed to the public branch. They should be considered exposed and
the runner's API key rotated. A history-rewrite commit follows
separately to scrub the values from older commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds:
  docs/demo2_membranes_opencl.mp4    8.7 MB  full-res, 1920×1088, 30 fps,
                                             3.5 sec covering 200 ms of
                                             sim time on demo2 (4 732
                                             liquid particles dropping
                                             onto two side-by-side
                                             elastic sheets — left has
                                             membranes, right doesn't)
  docs/demo2_membranes_opencl.gif    1.3 MB  720-px-wide, 10 fps, palette-
                                             optimized inline preview

The membrane effect is plain: liquid pools on top of the left
membrane-clad sheet and passes through the right unmembraned sheet to
the floor. This is the canonical visualization of the membrane physics
that the Metal substrate doesn't yet model, and a useful reference for
when membrane support gets ported.

README: new "Preview: demo2 membrane permeability" subsection with the
inline GIF, sized parameters, and a forward-link to caveat #4
explaining why this works on OpenCL but not yet on Metal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of the membrane port (parity plan). Stops silently ignoring
the two membrane sections in Sibernetic configs:

  [membranes]        -> int32[N_membrane, 3] triangle vertex IDs,
                        remapped from global to active-buffer-local
                        indexing (matches the bonds-buffer convention).
                        Written to <scenario>_membranes.bin.
  [particleMemIndex] -> int32[n_elastic*7] flat list of incident
                        membrane indices per elastic particle, -1 for
                        unused slots. MAX_MEMBRANES_INCLUDING_SAME_PARTICLE
                        = 7 matches sphFluid.cl:50. Written to
                        <scenario>_pmem_index.bin.

info dict now exposes 'n_membranes', paths['membranes'],
paths['pmem_index']. Verified counts on demo1 (432 triangles), demo2
(2112), worm_alone_half_resolution (2838).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of the membrane port. Adds three Metal kernels mirroring the
OpenCL membrane mechanism in sphFluid.cl:1042-1322:

  clear_membrane_correction        — zero mem_corr accumulator
  accumulate_membrane_correction   — per-liquid: scan elastic
                                      neighbors within r0, project onto
                                      incident triangles, accumulate
                                      Ihmsen-weighted normals, write
                                      delta to mem_corr
  apply_membrane_correction        — pos += mem_corr

Plus two inlined helpers: calc_det_3x3 and project_to_plane (the
3×3 Cramer's rule projection from sphFluid.cl:1066-1105). The
projection's degenerate-triangle case uses |denom| < 1e-9 to skip
rather than the OpenCL .w == -1 sentinel — Metal is type-safe so we
need an out-param + bool return.

ε safeguards at the four flagged sites:
  - project_to_plane: |denom| < 1e-9 → skip
  - distance check: r2 < 1e-12 → skip (coincident particles)
  - normal length: n_len2 < 1e-12 → skip (already on plane)
  - weight sum: w_sum < 1e-9 → skip (matches sphFluid.cl:1277)

Active-buffer ordering convention: elastic [0, n_elastic), liquid
[n_elastic, n_active). No separate types buffer needed; the kernel
uses gid >= n_elastic to filter to liquid (matches OpenCL's
LIQUID_PARTICLE check).

Verified: shader library still compiles, existing test_xpbd.py PASSES.
Wiring into the pipeline (and FD-validating a backward) come in
Phase 3 / Phase 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backward, SGD-tuned to OpenCL parity

Lands the membrane physics needed to run demo2 on the native Metal
substrate, plus three adjacent fixes that surfaced during the port:
boundary anchor springs (sheets stay aloft instead of crashing to
floor under gravity), correct rho_rest in sim-density units (density
constraint actually engages at sheet contact), and hard box-wall +
velocity clamp (impact-event ejection contained instead of permanent
escape).

What's new:

- M10 membranes: clear / accumulate / apply kernels in shaders.metal
  ported from sphFluid.cl:1042-1322, with hand-derived analytic
  backward through plane projection + unit-normal averaging + Ihmsen
  weighting. FD-validated to <7e-4 relative error in
  test_membrane_correction.py (single-triangle, shared-edge,
  multi-liquid). Multi-step backward chain wired into xpbd_full_bwd;
  test_xpbd_full_membrane.py passes K=1/2/3/5.

- M11 boundary anchor springs: load_config.decode_bonds() was
  silently dropping 670 elastic→boundary connections in demo2's
  [connection] table because their j index didn't map to active-buffer
  local. Added separate anchor record format + spring_anchor_force
  kernel + --anchor-k CLI; without these, sheets fall under gravity
  and crash to floor in 20 ms.

- M12 SPH pressure force kernel as backup; off by default since
  correct rho_rest makes the existing density constraint provide
  pressure-like behavior on its own.

- M13 box-wall + velocity clamp: impact-event ejection
  (~1000 OOB particles flying off in random xz directions) reduced
  to ~0. velocity_clamp at 1.0 m/s prevents CFL violations.

- New SGD harness sgd_demo2_permeability.py: finite-diff over
  dump_metal_trajectory CLI, loss = (y_mem-target)² + (y_por-target)²
  matching OpenCL t=16ms per-side liquid retention. Dropped loss
  562 → 220 (61%) over 13 iterations to K=2697, anchor_k=12177.

Adjacent fixes:

- dump_metal_trajectory.py: was zeroing initial vel_active; now
  reads from cfg['vel'][:, :3] (demo2 liquid starts at v_y=-0.027 m/s).

- dump_metal_trajectory.py rho_rest default: was 1000 (kg/m³, water);
  Metal density kernel uses h in particle units giving kg/sim_unit³,
  so default is now 8e-13 — the constraint only fires when liquid
  hits elastic neighbors (~10× rest density at sheet contact).

- ops_xpbd_step / ops_xpbd_full: membrane apply moved to AFTER
  update_velocities so it's a position-only soft constraint instead
  of a 41 750 m/s velocity injection.

- ops_xpbd_full xpbd_full_bwd: gradient-explosion diagnosis (not a
  NaN bug) — TBPTT=3 + log-space clip-norm=10 keeps gradients finite
  at Sibernetic-scale params; per-step amplification is ~400 000×.

Per-side liquid behavior at t=16ms (target = OpenCL):
  membrane side: y_mean=69 (target 56) — over-retains slightly
  porous side:   y_mean=13 (target 6)  — slightly less than full passthrough

README updated with new demo2 GIF + chronology of the porting work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion/one_sprig_test

The single-spring oscillation test (1 elastic on 1 spring anchored to a
boundary, stationary box) lives at configuration/test/one_sprig_test in
the historical layout. Sibernetic's `-f` flag only looks at
configuration/<name>, not configuration/test/<name>, so this top-level
copy makes it usable directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…muscle drive

The smallest elasticity scenario in the suite: one elastic particle on
one spring tied to a fixed anchor. Validates that the substrate's
spring/anchor machinery reproduces OpenCL period and amplitude exactly
(both backends read y = 18.606 at t = 2.000 ms with anchor_K = 555).

- shaders.metal: spring_anchor_force kernel extended with three new
  parameters (muscle_freq, muscle_amp, time_t) for time-varying anchor
  rest length: L_eff = L_base * (1 + amp * sin(2π * freq * t)).
- ops_xpbd_step.mm: per-step host-side time_t buffer with phase
  continuity carried across chunked invocations via time_offset_s.
- dump_metal_trajectory.py: --muscle-freq / --muscle-amp CLI args with
  cumulative_steps tracking; explicit '0' placeholder protocol when
  optional CLI slots between vel_clamp and muscle args are skipped.
- load_config.py: read 6 bare leading floats as box bounds when
  [simulation box] header is absent (one_sprig_test format).
- render_one_sprig.py: trajectory renderer with auto-detection of
  OpenCL (frame0 active+boundary, subsequent active-only) vs native
  Metal (active+boundary every frame) format.
- sgd_one_sprig.py: FD-SGD harness over the dump_metal_trajectory CLI;
  loss = weighted ((period - target_period)^2 + (half_amp - target_amp)^2)
  on log-space anchor_k. Confirms the kernel converges to anchor_K = 555.
- README.md: preview section paralleling demo1 / demo2 with side-by-side
  GIF.
Replace the three Preview blocks at the top of README with a single
2x2 combined GIF (sibernetic_demos_combined.gif) plus a one-line
description per demo and a pointer to the new docs/demos.md. Roll
the prior demo1-specific Quickstart into a single, demo-agnostic
"Quickstart: run a demo comparison" section that points readers to
the per-demo recipes in docs/demos.md.

- docs/demos.md (new): per-demo writeups for demo1, demo2,
  one_sprig_test, and worm_alone_half_resolution. Each includes
  parity numbers, the SGD-tuned Metal parameters, and a
  "Regenerate locally" block with the exact commands.
- docs/sibernetic_demos_combined.gif (new): 4-second 12 fps loop
  combining the four demo MP4s as a 2x2 grid (1000x1000), built
  with a two-pass palette pipeline.
- docs/worm_alone_opencl_vs_metal.mp4 (new): the 4-panel
  worm_alone comparison MP4 (top row iso, bottom row closeup,
  left OpenCL, right native Metal).
- src/metal_diff/render_worm.py (new): worm-config trajectory
  renderer with auto-format detection; supports iso, closeup,
  and cross views; CLI knobs for zoom, z-focus-frac,
  z-extent-frac, show-edges, focus-on-worm.
- src/metal_diff/sgd_worm.py (new): FD-SGD harness over the Metal
  trajectory CLI for worm configs, with per-frame elastic-only
  L2 loss against an OpenCL reference trajectory.
Worm immersed in 80 929-particle liquid bath — the next-up scale from
worm_alone (which sat on dry boundary). Worm shape preserved within
2 % over 25 ms, mean per-particle position drift ≤ 3.1 sim units by
end of trajectory (closer earlier: Δ = 0.5 at 1 ms, 0.6 at 5 ms,
0.8 at 10 ms). Buoyancy comes from XPBD's density-constraint solver
firing when local density exceeds rho_rest = 4e-13 (the natural water
density in our units, kg/sim_unit³).

Two settings dominate:

- rho_rest = 4e-13 (was 8e-13, above natural density — solver never
  fired). Below natural density makes the constraint persistently
  active.
- alpha_dens = 1e3 (was 1e-3, too soft to compete with the
  mass/rho² term in the XPBD denominator). 1e3 makes the
  per-iteration position correction a meaningful fraction of the
  density excess.

Substrate-level work (M14 — Ihmsen 2010 boundary handling):

- shaders.metal — new kernel boundary_position_correction. Sums
  weighted boundary normals from sorted_static neighbors, projects
  active particles along the resulting direction by w2_sum/w_sum.
  Dispatched both inside the inner XPBD loop (3× per step) and after
  update_velocities to keep velocity changes bounded.
- metal_common.{h,mm} — build_static_grid extended to optionally sort
  a parallel aux array (boundary normals) using the same permutation
  it applies to positions. Caller frees both.
- load_config.py — extracts boundary "velocities" (which are
  inward-pointing surface normals per Sibernetic config convention,
  sphFluid.cl:791) and writes them un-sorted; the C++ build_static_grid
  sorts them alongside positions in one pass.
- ops_xpbd_step.mm — new args bound_r0, bound_gain, path_bound_normals
  (argv[42..44]); seeds bufDens to rho_rest when use_pressure is on
  (was only firing when use_pair_forces was on, leaving pressure
  reading uninitialized memory and NaN-ing).

M14 is in place but is NOT what fixes worm_swim — the buoyancy fix
is the rho_rest/alpha_dens combination above. M14 retroactively
improves worm_alone (replaces its floor_y=2.25 hack with real Ihmsen
normals) and is the substrate-level groundwork for any scenario where
boundary repulsion matters.

Argv encoding fix in dump_metal_trajectory.py: vel_clamp + box_clamp
slot reservation now expands to also reserve when M14 args follow.

SGD harness extensions (sgd_worm.py): linear-space updates for
bounded params (floor_y), oscillation-detect-and-damp, alpha_dens
trainable.

Skill update (~/.claude/skills/sibernetic-demo-port/SKILL.md):
documents that build-what's-needed is the rule, not "ship with
caveats and ask permission to keep working".
Extended the implementation to support Metal alongside OpenCL.
- docs/worm_alone_opencl_vs_metal.gif: 1.4 MB, 12 fps, 600px wide.
  worm_alone has had an MP4 in docs/ since the original port but no
  embeddable preview; demos.md now displays it inline at the top of
  the section (matching demo1 / demo2 / one_sprig_test convention).
- docs/worm_swim_opencl_vs_metal.gif: 3.6 MB, 8 fps, 400px wide.
  worm_swim is bigger because the 80 929 swirling water particles
  produce huge inter-frame variation; even at the tightest dither and
  64 colors a smaller-than-this gif wasn't readable. Embedded in
  demos.md.
- docs/sibernetic_demos_combined.gif: rebuilt as a 5-panel mosaic
  (was 4). Top row: demo1 / demo2 / one_sprig_test. Bottom row:
  worm_alone / [white spacer] / worm_swim. Same 4-second loop. Also
  bumped alt text in README.md to mention all 5.
- .github/workflows/ci-build.yml: pip-install numpy on the macOS
  runner before the Native-Metal smoke test. The smoke test imports
  load_config which `import numpy as np`, but macos-latest's system
  python3 doesn't ship numpy by default — caused every push since the
  worm_alone port to fail this step. Use --break-system-packages
  because Python 3.13+ on the runner treats the system interpreter
  as PEP-668 externally-managed.
load_to_metal_buffers() looks up `configuration/<scenario>` relative to
cwd. The previous fix to the macOS smoke test (install numpy) revealed
this second issue: cd-ing into src/metal_diff/ before the python heredoc
broke the configuration/ lookup. Pre-locate the binary as $SIB_METAL,
keep cwd at repo root, and reference both via env.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants