diff --git a/.claude/commands/sweep-performance.md b/.claude/commands/sweep-performance.md new file mode 100644 index 00000000..2079f191 --- /dev/null +++ b/.claude/commands/sweep-performance.md @@ -0,0 +1,494 @@ +# Performance Sweep: Parallel Triage and Fix Workflow + +Audit xrspatial modules for performance bottlenecks, OOM risk under 30TB dask +workloads, and backend-specific anti-patterns. Dispatches parallel subagents +for fast triage, then generates a ralph-loop to benchmark and fix HIGH-severity +issues. + +Optional arguments: $ARGUMENTS +(e.g. `--top 5`, `--exclude slope,aspect`, `--only-io`, `--reset-state`) + +--- + +## Step 0 -- Determine mode and parse arguments + +Parse $ARGUMENTS for these flags (multiple may combine): + +| Flag | Effect | +|------|--------| +| `--top N` | Limit Phase 1 to the top N scored modules (default: all) | +| `--exclude mod1,mod2` | Remove named modules from scope | +| `--only-terrain` | Restrict to: slope, aspect, curvature, terrain, terrain_metrics, hillshade, sky_view_factor | +| `--only-focal` | Restrict to: focal, convolution, morphology, bilateral, edge_detection, glcm | +| `--only-hydro` | Restrict to: flood, cost_distance, geodesic, surface_distance, viewshed, erosion, diffusion | +| `--only-io` | Restrict to: geotiff, reproject, rasterize, polygonize | +| `--reset-state` | Delete `.claude/performance-sweep-state.json` and treat all modules as never-inspected | +| `--skip-phase1` | Skip triage; reuse last state file; go straight to ralph-loop generation for unresolved HIGH items | +| `--report-only` | Run Phase 1 triage but do not generate a ralph-loop command | +| `--size small` | Phase 2 benchmarks use 128x128 arrays | +| `--size large` | Phase 2 benchmarks use 2048x2048 arrays | +| `--high-only` | Only report HIGH severity findings in the triage output | + +If `--skip-phase1` is set, jump to Step 6 (ralph-loop generation). +Otherwise proceed to Step 1. + +## Step 1 -- Discover modules in scope + +Enumerate all candidate modules. For each, record its file path(s): + +**Single-file modules:** Every `.py` file directly under `xrspatial/`, excluding +`__init__.py`, `_version.py`, `__main__.py`, `utils.py`, `accessor.py`, +`preview.py`, `dataset_support.py`, `diagnostics.py`, `analytics.py`. + +**Subpackage modules:** The `geotiff/` and `reproject/` directories under +`xrspatial/`. Treat each subpackage as a single audit unit. List all `.py` +files within each (excluding `__init__.py`). + +Apply `--only-*` and `--exclude` filters from Step 0 to narrow the list. + +Store the filtered module list in memory (do NOT write intermediate files). + +## Step 2 -- Gather metadata and score each module + +For every module in scope, collect: + +| Field | How | +|-------|-----| +| **last_modified** | `git log -1 --format=%aI -- ` (for subpackages, use the most recent file) | +| **total_commits** | `git log --oneline -- \| wc -l` | +| **loc** | `wc -l < ` (for subpackages, sum all files) | +| **has_dask_backend** | grep the file(s) for `_run_dask`, `map_overlap`, `map_blocks` | +| **has_cuda_backend** | grep the file(s) for `@cuda.jit`, `import cupy` | +| **is_io_module** | module is geotiff or reproject | +| **has_existing_bench** | a file matching the module name exists in `benchmarks/benchmarks/` | + +### Load inspection state + +Read `.claude/performance-sweep-state.json`. If it does not exist, treat every +module as never-inspected. If `--reset-state` was set, delete the file first. + +State file schema: + +```json +{ + "last_triage": "ISO-DATE", + "modules": { + "slope": { + "last_inspected": "ISO-DATE", + "oom_verdict": "SAFE", + "bottleneck": "compute-bound", + "high_count": 0, + "issue": null + } + } +} +``` + +### Compute scores + +``` +days_since_inspected = (today - last_perf_inspected).days # 9999 if never +days_since_modified = (today - last_modified).days + +score = (days_since_inspected * 3) + + (loc * 0.1) + + (total_commits * 0.5) + + (has_dask_backend * 200) + + (has_cuda_backend * 150) + + (is_io_module * 300) + - (days_since_modified * 0.2) + - (has_existing_bench * 100) +``` + +Sort modules by score descending. If `--top N` is set, keep only the top N. + +## Step 3 -- Dispatch parallel subagents for static triage + +For each module in the scored list, dispatch a subagent using the Agent tool. +Launch ALL subagents in a single message (parallel dispatch). Each subagent +receives the prompt below, with `MODULE_NAME` and `MODULE_FILES` substituted. + +**Subagent prompt template:** + +~~~ +You are auditing the xrspatial module "MODULE_NAME" for performance issues. + +Read these files: MODULE_FILES + +Perform ALL of the following analyses and return your findings as a single +JSON object. Do NOT modify any files. This is read-only analysis. + +### 1. Dask Path Analysis + +Trace every dask code path (_run_dask, _run_dask_cupy, or any function that +receives dask-backed DataArrays). Flag these patterns with severity: + +- HIGH: `.values` on a dask-backed DataArray or CuPy array (premature materialization) +- HIGH: `.compute()` inside a loop (materializes full graph each iteration) +- HIGH: `np.array()` or `np.asarray()` wrapping a dask or CuPy array +- MEDIUM: `da.stack()` without a following `.rechunk()` +- MEDIUM: `map_overlap` with depth >= chunk_size / 4 +- MEDIUM: Missing `boundary` argument in `map_overlap` +- MEDIUM: Same function called twice on same input without caching +- MEDIUM: Python `for` loop iterating over dask chunks (serializes the graph) + +If the module has NO dask code path, note "no dask backend" and skip. + +### 2. 30TB / 16GB OOM Verdict + +For each dask code path found in section 1: + +**Part A — Static trace:** Follow the code end-to-end. Answer: does peak +memory scale with total array size, or with chunk size? If any operation +forces full materialization, the verdict is WILL OOM. + +**Part B — Task graph simulation:** Write and run a Python script (in /tmp/ +with a unique name including "MODULE_NAME") that: + +```python +import dask.array as da +import xarray as xr +import json, sys + +arr = da.zeros((2560, 2560), chunks=(256, 256), dtype='float64') +raster = xr.DataArray(arr, dims=['y', 'x']) + +# Add coords if the function needs them (geodesic, slope with CRS, etc.) +# raster = raster.assign_coords(x=np.linspace(-180, 180, 2560), +# y=np.linspace(-90, 90, 2560)) + +try: + result = MODULE_FUNCTION(raster, **DEFAULT_ARGS) + graph = result.__dask_graph__() + task_count = len(graph) + tasks_per_chunk = task_count / 100.0 + + # Check for fan-out: any task key that depends on more than 4 other tasks + deps = dict(graph) + max_fan_in = 0 + for key, val in deps.items(): + if hasattr(val, '__dask_graph__'): + sub = val.__dask_graph__() + max_fan_in = max(max_fan_in, len(sub)) + + print(json.dumps({ + "success": True, + "task_count": task_count, + "tasks_per_chunk": round(tasks_per_chunk, 2), + "max_fan_in": max_fan_in, + "extrapolation_30tb": "~{} tasks at 57M chunks".format( + int(tasks_per_chunk * 57_000_000)) + })) +except Exception as e: + print(json.dumps({"success": False, "error": str(e)})) +``` + +Adapt the function call and imports for the specific module. Run the script +and capture its JSON output. If it errors, record the error and rely on +Part A alone. + +**Verdict:** One of: +- `SAFE` — memory bounded by chunk size, graph scales linearly +- `RISKY` — bounded but tight (e.g. large overlap depth, 3D intermediates) +- `WILL OOM` — forces full materialization or unbounded memory growth + +### 3. GPU Transfer Analysis + +Scan for CuPy/CUDA code paths. Flag: + +- HIGH: `.data.get()` followed by CuPy operations (GPU-CPU-GPU round-trip) +- HIGH: `cupy.asarray()` inside a loop (repeated CPU-GPU transfers) +- MEDIUM: Mixing NumPy and CuPy ops in same function without clear reason +- MEDIUM: Register pressure — count float64 local variables in `@cuda.jit` + kernels; flag if >20 +- MEDIUM: Thread blocks >16x16 on kernels with >20 float64 locals + +If the module has NO GPU code path, note "no GPU backend" and skip. + +### 4. Memory Allocation Patterns + +- MEDIUM: Unnecessary `.copy()` on arrays never mutated downstream +- MEDIUM: Large temporary arrays that could be fused into the kernel +- LOW: `np.zeros_like()` + fill loop where `np.empty()` would suffice + +### 5. Numba Anti-Patterns + +- MEDIUM: Missing `@ngjit` on nested for-loops over `.data` arrays +- MEDIUM: `@jit` without `nopython=True` (object-mode fallback risk) +- LOW: Type instability — initializing with int then assigning float +- LOW: Column-major iteration on row-major arrays (inner loop should be last axis) + +### 6. Bottleneck Classification + +Based on your analysis, classify the module as ONE of: +- `IO-bound` — dominated by disk reads/writes or serialization +- `memory-bound` — peak allocation is the limiting factor +- `compute-bound` — CPU/GPU time dominates, memory is fine +- `graph-bound` — dask task graph overhead dominates + +### Output Format + +Return EXACTLY this JSON structure (no extra text before or after): + +```json +{ + "module": "MODULE_NAME", + "files_read": ["list of files you read"], + "findings": [ + { + "severity": "HIGH|MEDIUM|LOW", + "category": "dask_materialization|dask_chunking|gpu_transfer|register_pressure|memory_allocation|numba_antipattern", + "file": "filename.py", + "line": 123, + "description": "what the issue is", + "fix": "how to fix it", + "backends_affected": ["dask+numpy", "dask+cupy", "cupy", "numpy"] + } + ], + "oom_verdict": { + "dask_numpy": "SAFE|RISKY|WILL OOM", + "dask_cupy": "SAFE|RISKY|WILL OOM", + "reasoning": "one-sentence explanation", + "estimated_peak_per_chunk_mb": 0.5, + "task_count": 3721, + "tasks_per_chunk": 37.21, + "graph_simulation_ran": true + }, + "bottleneck": "compute-bound|memory-bound|IO-bound|graph-bound", + "bottleneck_reasoning": "one-sentence explanation" +} +``` + +IMPORTANT: Only flag patterns that are ACTUALLY present in the code. Do not +report hypothetical issues. False positives are worse than missed issues. +If a pattern like `.values` is used on a known-numpy-only code path, do not +flag it. +~~~ + +Wait for all subagents to return before proceeding to Step 4. + +## Step 4 -- Merge results and print the triage report + +Parse the JSON returned by each subagent. If a subagent returned malformed +output, record the module as "audit failed" with a note. + +### 4a. Print the Module Risk Ranking Table + +Sort modules by score descending. Print: + +``` +## Performance Sweep — Static Triage Report + +### Module Risk Ranking +| Rank | Module | Score | OOM Verdict | Bottleneck | HIGH | MED | LOW | +|------|-----------------|--------|-----------------|---------------|------|-----|-----| +| 1 | geotiff | 31200 | WILL OOM (d+np) | IO-bound | 3 | 1 | 0 | +| 2 | viewshed | 30050 | RISKY (d+np) | memory-bound | 2 | 2 | 1 | +| ... | ... | ... | ... | ... | ... | ... | ... | +``` + +If `--high-only` is set, only count HIGH findings and omit modules with zero HIGH. + +### 4b. Print the 30TB / 16GB Verdict Summary + +Group modules by OOM verdict: + +``` +### 30TB on Disk / 16GB RAM — Out-of-Memory Analysis + +#### WILL OOM (fix required) +- **module_name**: reasoning from subagent + +#### RISKY (bounded but tight) +- **module_name**: reasoning from subagent + +#### SAFE (memory bounded by chunk size) +- module_name, module_name, module_name, ... +``` + +### 4c. Print Detailed Findings + +For each module that has findings, print a severity-grouped table: + +``` +### module_name (bottleneck: compute-bound, OOM: SAFE) + +| # | Severity | File:Line | Category | Description | Fix | +|---|----------|----------------|-------------------------|------------------------------|-------------------------------| +| 1 | HIGH | slope.py:142 | dask_materialization | .values on dask input | Use .data or stay lazy | +| 2 | MEDIUM | slope.py:88 | dask_chunking | map_overlap depth too large | Reduce depth or warn users | +``` + +### 4d. Print Actionable Rockout Commands + +For each HIGH-severity finding, print a ready-to-paste `/rockout` command: + +``` +### Ready-to-Run Fixes (HIGH severity only) + +1. **geotiff** — eager .values materialization (WILL OOM) + /rockout "Fix eager .values materialization in geotiff reader. + The dask read path at reader.py:87 calls .values which forces + the full array into memory. For 30TB inputs this will OOM on + a 16GB machine. Must stay lazy through the entire read path." + +2. **cost_distance** — iterative solver unbounded memory (WILL OOM) + /rockout "Fix cost_distance iterative solver to work within + bounded memory. Currently materializes the full distance matrix + each iteration. Must use chunked iteration for 30TB dask inputs." +``` + +Construct each `/rockout` command from the finding's description and fix fields. +Include the OOM verdict and bottleneck classification in the prompt text so +rockout has full context. + +## Step 5 -- Update state file + +Write `.claude/performance-sweep-state.json` with the triage results: + +```json +{ + "last_triage": "", + "modules": { + "": { + "last_inspected": "", + "oom_verdict": "", + "bottleneck": "", + "high_count": "", + "issue": null + } + } +} +``` + +If the file already exists, merge — update entries for modules that were +just audited, keep entries for modules not in this run's scope. + +If `--report-only` is set, stop here. Do not proceed to Step 6. + +## Step 6 -- Generate the ralph-loop command + +Collect all modules from Step 4 (or from the state file if `--skip-phase1`) +that have at least one HIGH-severity finding and no `issue` recorded in the +state file (i.e. not yet fixed). + +Sort them by: WILL OOM first, then RISKY, then by HIGH count descending. + +Determine the benchmark array size from arguments: +- `--size small` → 128x128 +- `--size large` → 2048x2048 +- default → 512x512 + +### 6a. Print the ranked target list + +``` +### Phase 2 Targets (HIGH severity, unfixed) +| # | Module | HIGH Count | OOM Verdict | Bottleneck | +|---|---------------|------------|-------------|--------------| +| 1 | geotiff | 3 | WILL OOM | IO-bound | +| 2 | cost_distance | 1 | WILL OOM | memory-bound | +| 3 | viewshed | 2 | RISKY | memory-bound | +``` + +If no modules qualify, print: +"No HIGH-severity findings to fix. Run `/sweep-performance` without +`--skip-phase1` to refresh the triage." +Then stop. + +### 6b. Print the ralph-loop command + +Using the target list, generate and print: + +```` +/ralph-loop "Performance sweep Phase 2: benchmark and fix HIGH-severity findings. + +**Target modules in priority order:** +1. ( HIGH findings, ) -- +2. ... +... + +**For each module, in order:** + +1. Write a benchmark script at /tmp/perf_sweep_bench_.py that: + - Imports the module's public functions + - Creates a test array (x, float64) + - For EACH available backend (numpy, dask+numpy; cupy and dask+cupy only if available): + a. Wrap the array in the appropriate DataArray type + b. Measure wall time: timeit.repeat(number=1, repeat=3), take median + c. Measure Python memory: tracemalloc.start() / tracemalloc.get_traced_memory()[1] for peak + d. Measure process memory: resource.getrusage(RUSAGE_SELF).ru_maxrss before and after + e. For CuPy backends: cupy.get_default_memory_pool().used_bytes() before and after + - Print results as JSON to stdout + +2. Run the benchmark script and capture results. + +3. Confirm the HIGH finding from Phase 1: + - If the dask backend uses significantly more memory than expected for + the chunk size, or wall time shows a materialization stall: CONFIRMED. + - If the benchmark shows no anomaly: downgrade to MEDIUM in state file, + print 'False positive — skipping' and move to the next module. + +4. If confirmed: run /rockout to fix the issue end-to-end (issue, worktree, + implementation, tests, docs). Include the benchmark numbers in the + issue body for context. + +5. After rockout completes: rerun the same benchmark script. Print a + before/after comparison: + | Backend | Metric | Before | After | Ratio | Verdict | + |------------|-------------|--------|--------|-------|------------| + | numpy | wall_ms | 45.2 | 12.1 | 0.27x | IMPROVED | + | dask+numpy | peak_rss_mb | 892 | 34 | 0.04x | IMPROVED | + Thresholds: IMPROVED < 0.8x, REGRESSION > 1.2x, else UNCHANGED. + +6. Update .claude/performance-sweep-state.json with the issue number. + +7. Output ITERATION DONE + +If all targets have been addressed or confirmed as false positives: +ALL PERFORMANCE ISSUES FIXED." --max-iterations --completion-promise "ALL PERFORMANCE ISSUES FIXED" +```` + +Set `--max-iterations` to the number of target modules plus 2 (buffer for +retries). + +### 6c. Print reminder text + +``` +Phase 1 triage complete. To proceed with fixes: + Copy the ralph-loop command above and paste it. + +Other options: + Fix one manually: copy any /rockout command from the report above + Rerun triage only: /sweep-performance --report-only + Skip Phase 1: /sweep-performance --skip-phase1 (reuses last triage) + Reset all tracking: /sweep-performance --reset-state +``` + +--- + +## General Rules + +- Phase 1 subagents do NOT modify any source, test, or benchmark files. + Read-only analysis only. +- Phase 2 ralph-loop modifies code only through `/rockout`. +- Temporary benchmark scripts and graph simulation scripts go in `/tmp/` + with unique names including the module name (e.g. `/tmp/perf_sweep_bench_slope.py`, + `/tmp/perf_sweep_graph_slope.py`). Clean them up after capturing results. +- Only flag patterns that are ACTUALLY present in the code. Do not report + hypothetical issues or patterns that "could" occur. +- Include the exact file path and line number for every finding so the user + can navigate directly to the issue. +- False positives are worse than missed issues. If you are not confident a + pattern is actually harmful in context (e.g. `.values` used intentionally + on a known-numpy array), do not flag it. +- The 30TB simulation constructs the dask task graph only; it NEVER calls + `.compute()`. +- State file (`.claude/performance-sweep-state.json`) is gitignored by + convention — do not add it to git. +- If $ARGUMENTS is empty, use defaults: audit all modules, benchmark at + 512x512, generate ralph-loop for HIGH items. +- For subpackage modules (geotiff, reproject), the subagent should read ALL + `.py` files in the subpackage directory, not just `__init__.py`. +- When generating `/rockout` commands, include the OOM verdict, bottleneck + classification, and affected backends in the prompt text so rockout has + full performance context. diff --git a/docs/superpowers/plans/2026-03-31-sweep-performance.md b/docs/superpowers/plans/2026-03-31-sweep-performance.md new file mode 100644 index 00000000..8615a41e --- /dev/null +++ b/docs/superpowers/plans/2026-03-31-sweep-performance.md @@ -0,0 +1,743 @@ +# Sweep-Performance Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Create a `/sweep-performance` slash command that audits all xrspatial modules for performance bottlenecks, OOM risk under 30TB dask workloads, and backend anti-patterns using parallel subagents, then generates a ralph-loop to fix HIGH-severity issues. + +**Architecture:** Single command file (`.claude/commands/sweep-performance.md`) containing all instructions for both phases. Phase 1 dispatches parallel subagents via the Agent tool for static analysis + 30TB graph simulation. Phase 2 generates a `/ralph-loop` command targeting HIGH-severity modules for real benchmarks and `/rockout` fixes. State persisted in `.claude/performance-sweep-state.json`. + +**Tech Stack:** Claude Code slash commands (markdown), Agent tool for subagent dispatch, Bash for git metadata and benchmark scripts, dask for graph simulation, tracemalloc/resource/cupy for memory measurement. + +--- + +## File Structure + +| File | Purpose | +|------|---------| +| Create: `.claude/commands/sweep-performance.md` | The slash command — all Phase 1 and Phase 2 logic | +| Create: `.claude/performance-sweep-state.json` | Runtime state file (created by the command at execution time, not committed) | + +This is a single-file deliverable. The command file contains all the instructions that Claude follows when `/sweep-performance` is invoked. No Python code, no library files — just a well-structured prompt document, same pattern as `accuracy-sweep.md`. + +--- + +### Task 1: Scaffold the Command Header and Argument Parsing + +**Files:** +- Create: `.claude/commands/sweep-performance.md` + +- [ ] **Step 1: Create the command file with title, description, and argument parsing** + +```markdown +# Performance Sweep: Parallel Triage and Fix Workflow + +Audit xrspatial modules for performance bottlenecks, OOM risk under 30TB dask +workloads, and backend-specific anti-patterns. Dispatches parallel subagents +for fast triage, then generates a ralph-loop to benchmark and fix HIGH-severity +issues. + +Optional arguments: $ARGUMENTS +(e.g. `--top 5`, `--exclude slope,aspect`, `--only-io`, `--reset-state`) + +--- + +## Step 0 -- Determine mode and parse arguments + +Parse $ARGUMENTS for these flags (multiple may combine): + +| Flag | Effect | +|------|--------| +| `--top N` | Limit Phase 1 to the top N scored modules (default: all) | +| `--exclude mod1,mod2` | Remove named modules from scope | +| `--only-terrain` | Restrict to: slope, aspect, curvature, terrain, terrain_metrics, hillshade, sky_view_factor | +| `--only-focal` | Restrict to: focal, convolution, morphology, bilateral, edge_detection, glcm | +| `--only-hydro` | Restrict to: flood, cost_distance, geodesic, surface_distance, viewshed, erosion, diffusion | +| `--only-io` | Restrict to: geotiff, reproject, rasterize, polygonize | +| `--reset-state` | Delete `.claude/performance-sweep-state.json` and treat all modules as never-inspected | +| `--skip-phase1` | Skip triage; reuse last state file; go straight to ralph-loop generation for unresolved HIGH items | +| `--report-only` | Run Phase 1 triage but do not generate a ralph-loop command | +| `--size small` | Phase 2 benchmarks use 128x128 arrays | +| `--size large` | Phase 2 benchmarks use 2048x2048 arrays | +| `--high-only` | Only report HIGH severity findings in the triage output | + +If `--skip-phase1` is set, jump to Step 6 (ralph-loop generation). +Otherwise proceed to Step 1. +``` + +- [ ] **Step 2: Verify the file was created correctly** + +Run: `head -40 .claude/commands/sweep-performance.md` +Expected: The title, description, and Step 0 argument table are present. + +- [ ] **Step 3: Commit** + +```bash +git add .claude/commands/sweep-performance.md +git commit -m "Add sweep-performance command scaffold with argument parsing" +``` + +--- + +### Task 2: Module Discovery and Scoring (Step 1-2) + +**Files:** +- Modify: `.claude/commands/sweep-performance.md` + +- [ ] **Step 1: Append Step 1 (module discovery) to the command file** + +Append the following to `.claude/commands/sweep-performance.md`: + +```markdown +## Step 1 -- Discover modules in scope + +Enumerate all candidate modules. For each, record its file path(s): + +**Single-file modules:** Every `.py` file directly under `xrspatial/`, excluding +`__init__.py`, `_version.py`, `__main__.py`, `utils.py`, `accessor.py`, +`preview.py`, `dataset_support.py`, `diagnostics.py`, `analytics.py`. + +**Subpackage modules:** The `geotiff/` and `reproject/` directories under +`xrspatial/`. Treat each subpackage as a single audit unit. List all `.py` +files within each (excluding `__init__.py`). + +Apply `--only-*` and `--exclude` filters from Step 0 to narrow the list. + +Store the filtered module list in memory (do NOT write intermediate files). +``` + +- [ ] **Step 2: Append Step 2 (git metadata and scoring) to the command file** + +Append the following: + +```markdown +## Step 2 -- Gather metadata and score each module + +For every module in scope, collect: + +| Field | How | +|-------|-----| +| **last_modified** | `git log -1 --format=%aI -- ` (for subpackages, use the most recent file) | +| **total_commits** | `git log --oneline -- \| wc -l` | +| **loc** | `wc -l < ` (for subpackages, sum all files) | +| **has_dask_backend** | grep the file(s) for `_run_dask`, `map_overlap`, `map_blocks` | +| **has_cuda_backend** | grep the file(s) for `@cuda.jit`, `import cupy` | +| **is_io_module** | module is geotiff or reproject | +| **has_existing_bench** | a file matching the module name exists in `benchmarks/benchmarks/` | + +### Load inspection state + +Read `.claude/performance-sweep-state.json`. If it does not exist, treat every +module as never-inspected. If `--reset-state` was set, delete the file first. + +State file schema: + +```json +{ + "last_triage": "ISO-DATE", + "modules": { + "slope": { + "last_inspected": "ISO-DATE", + "oom_verdict": "SAFE", + "bottleneck": "compute-bound", + "high_count": 0, + "issue": null + } + } +} +``` + +### Compute scores + +``` +days_since_inspected = (today - last_perf_inspected).days # 9999 if never +days_since_modified = (today - last_modified).days + +score = (days_since_inspected * 3) + + (loc * 0.1) + + (total_commits * 0.5) + + (has_dask_backend * 200) + + (has_cuda_backend * 150) + + (is_io_module * 300) + - (days_since_modified * 0.2) + - (has_existing_bench * 100) +``` + +Sort modules by score descending. If `--top N` is set, keep only the top N. +``` + +- [ ] **Step 3: Verify the appended steps read correctly** + +Run: `grep -c "^## Step" .claude/commands/sweep-performance.md` +Expected: `3` (Step 0, Step 1, Step 2) + +- [ ] **Step 4: Commit** + +```bash +git add .claude/commands/sweep-performance.md +git commit -m "Add module discovery and scoring to sweep-performance" +``` + +--- + +### Task 3: Phase 1 Subagent Dispatch (Step 3) + +**Files:** +- Modify: `.claude/commands/sweep-performance.md` + +- [ ] **Step 1: Append Step 3 (subagent dispatch and analysis instructions)** + +Append the following to `.claude/commands/sweep-performance.md`: + +````markdown +## Step 3 -- Dispatch parallel subagents for static triage + +For each module in the scored list, dispatch a subagent using the Agent tool. +Launch ALL subagents in a single message (parallel dispatch). Each subagent +receives the prompt below, with `MODULE_NAME` and `MODULE_FILES` substituted. + +**Subagent prompt template:** + +``` +You are auditing the xrspatial module "MODULE_NAME" for performance issues. + +Read these files: MODULE_FILES + +Perform ALL of the following analyses and return your findings as a single +JSON object. Do NOT modify any files. This is read-only analysis. + +## 1. Dask Path Analysis + +Trace every dask code path (_run_dask, _run_dask_cupy, or any function that +receives dask-backed DataArrays). Flag these patterns with severity: + +- HIGH: `.values` on a dask-backed DataArray or CuPy array (premature materialization) +- HIGH: `.compute()` inside a loop (materializes full graph each iteration) +- HIGH: `np.array()` or `np.asarray()` wrapping a dask or CuPy array +- MEDIUM: `da.stack()` without a following `.rechunk()` +- MEDIUM: `map_overlap` with depth >= chunk_size / 4 +- MEDIUM: Missing `boundary` argument in `map_overlap` +- MEDIUM: Same function called twice on same input without caching +- MEDIUM: Python `for` loop iterating over dask chunks (serializes the graph) + +If the module has NO dask code path, note "no dask backend" and skip. + +## 2. 30TB / 16GB OOM Verdict + +For each dask code path found in section 1: + +**Part A — Static trace:** Follow the code end-to-end. Answer: does peak +memory scale with total array size, or with chunk size? If any operation +forces full materialization, the verdict is WILL OOM. + +**Part B — Task graph simulation:** Write and run a Python script (in /tmp/ +with a unique name including "MODULE_NAME") that: + +```python +import dask.array as da +import xarray as xr +import json, sys + +arr = da.zeros((2560, 2560), chunks=(256, 256), dtype='float64') +raster = xr.DataArray(arr, dims=['y', 'x']) + +# Add coords if the function needs them (geodesic, slope with CRS, etc.) +# raster = raster.assign_coords(x=np.linspace(-180, 180, 2560), +# y=np.linspace(-90, 90, 2560)) + +try: + result = MODULE_FUNCTION(raster, **DEFAULT_ARGS) + graph = result.__dask_graph__() + task_count = len(graph) + tasks_per_chunk = task_count / 100.0 + + # Check for fan-out: any task key that depends on more than 4 other tasks + deps = dict(graph) + max_fan_in = 0 + for key, val in deps.items(): + if hasattr(val, '__dask_graph__'): + sub = val.__dask_graph__() + max_fan_in = max(max_fan_in, len(sub)) + + print(json.dumps({ + "success": True, + "task_count": task_count, + "tasks_per_chunk": round(tasks_per_chunk, 2), + "max_fan_in": max_fan_in, + "extrapolation_30tb": "~{} tasks at 57M chunks".format( + int(tasks_per_chunk * 57_000_000)) + })) +except Exception as e: + print(json.dumps({"success": False, "error": str(e)})) +``` + +Adapt the function call and imports for the specific module. Run the script +and capture its JSON output. If it errors, record the error and rely on +Part A alone. + +**Verdict:** One of: +- `SAFE` — memory bounded by chunk size, graph scales linearly +- `RISKY` — bounded but tight (e.g. large overlap depth, 3D intermediates) +- `WILL OOM` — forces full materialization or unbounded memory growth + +## 3. GPU Transfer Analysis + +Scan for CuPy/CUDA code paths. Flag: + +- HIGH: `.data.get()` followed by CuPy operations (GPU-CPU-GPU round-trip) +- HIGH: `cupy.asarray()` inside a loop (repeated CPU-GPU transfers) +- MEDIUM: Mixing NumPy and CuPy ops in same function without clear reason +- MEDIUM: Register pressure — count float64 local variables in `@cuda.jit` + kernels; flag if >20 +- MEDIUM: Thread blocks >16x16 on kernels with >20 float64 locals + +If the module has NO GPU code path, note "no GPU backend" and skip. + +## 4. Memory Allocation Patterns + +- MEDIUM: Unnecessary `.copy()` on arrays never mutated downstream +- MEDIUM: Large temporary arrays that could be fused into the kernel +- LOW: `np.zeros_like()` + fill loop where `np.empty()` would suffice + +## 5. Numba Anti-Patterns + +- MEDIUM: Missing `@ngjit` on nested for-loops over `.data` arrays +- MEDIUM: `@jit` without `nopython=True` (object-mode fallback risk) +- LOW: Type instability — initializing with int then assigning float +- LOW: Column-major iteration on row-major arrays (inner loop should be last axis) + +## 6. Bottleneck Classification + +Based on your analysis, classify the module as ONE of: +- `IO-bound` — dominated by disk reads/writes or serialization +- `memory-bound` — peak allocation is the limiting factor +- `compute-bound` — CPU/GPU time dominates, memory is fine +- `graph-bound` — dask task graph overhead dominates + +## Output Format + +Return EXACTLY this JSON structure (no extra text before or after): + +```json +{ + "module": "MODULE_NAME", + "files_read": ["list of files you read"], + "findings": [ + { + "severity": "HIGH|MEDIUM|LOW", + "category": "dask_materialization|dask_chunking|gpu_transfer|register_pressure|memory_allocation|numba_antipattern", + "file": "filename.py", + "line": 123, + "description": "what the issue is", + "fix": "how to fix it", + "backends_affected": ["dask+numpy", "dask+cupy", "cupy", "numpy"] + } + ], + "oom_verdict": { + "dask_numpy": "SAFE|RISKY|WILL OOM", + "dask_cupy": "SAFE|RISKY|WILL OOM", + "reasoning": "one-sentence explanation", + "estimated_peak_per_chunk_mb": 0.5, + "task_count": 3721, + "tasks_per_chunk": 37.21, + "graph_simulation_ran": true + }, + "bottleneck": "compute-bound|memory-bound|IO-bound|graph-bound", + "bottleneck_reasoning": "one-sentence explanation" +} +``` + +IMPORTANT: Only flag patterns that are ACTUALLY present in the code. Do not +report hypothetical issues. False positives are worse than missed issues. +If a pattern like `.values` is used on a known-numpy-only code path, do not +flag it. +``` + +Wait for all subagents to return before proceeding to Step 4. +```` + +- [ ] **Step 2: Verify the subagent prompt is well-formed** + +Run: `grep -c "## [0-9]" .claude/commands/sweep-performance.md` +Expected: At least 6 (the six analysis sections inside the subagent prompt) + +- [ ] **Step 3: Commit** + +```bash +git add .claude/commands/sweep-performance.md +git commit -m "Add Phase 1 subagent dispatch and analysis template" +``` + +--- + +### Task 4: Phase 1 Report Merging and State Update (Steps 4-5) + +**Files:** +- Modify: `.claude/commands/sweep-performance.md` + +- [ ] **Step 1: Append Step 4 (merge subagent results into report)** + +Append the following to `.claude/commands/sweep-performance.md`: + +````markdown +## Step 4 -- Merge results and print the triage report + +Parse the JSON returned by each subagent. If a subagent returned malformed +output, record the module as "audit failed" with a note. + +### 4a. Print the Module Risk Ranking Table + +Sort modules by score descending. Print: + +``` +## Performance Sweep — Static Triage Report + +### Module Risk Ranking +| Rank | Module | Score | OOM Verdict | Bottleneck | HIGH | MED | LOW | +|------|-----------------|--------|-----------------|---------------|------|-----|-----| +| 1 | geotiff | 31200 | WILL OOM (d+np) | IO-bound | 3 | 1 | 0 | +| 2 | viewshed | 30050 | RISKY (d+np) | memory-bound | 2 | 2 | 1 | +| ... | ... | ... | ... | ... | ... | ... | ... | +``` + +If `--high-only` is set, only count HIGH findings and omit modules with zero HIGH. + +### 4b. Print the 30TB / 16GB Verdict Summary + +Group modules by OOM verdict: + +``` +### 30TB on Disk / 16GB RAM — Out-of-Memory Analysis + +#### WILL OOM (fix required) +- **module_name**: reasoning from subagent + +#### RISKY (bounded but tight) +- **module_name**: reasoning from subagent + +#### SAFE (memory bounded by chunk size) +- module_name, module_name, module_name, ... +``` + +### 4c. Print Detailed Findings + +For each module that has findings, print a severity-grouped table: + +``` +### module_name (bottleneck: compute-bound, OOM: SAFE) + +| # | Severity | File:Line | Category | Description | Fix | +|---|----------|----------------|-------------------------|------------------------------|-------------------------------| +| 1 | HIGH | slope.py:142 | dask_materialization | .values on dask input | Use .data or stay lazy | +| 2 | MEDIUM | slope.py:88 | dask_chunking | map_overlap depth too large | Reduce depth or warn users | +``` + +### 4d. Print Actionable Rockout Commands + +For each HIGH-severity finding, print a ready-to-paste `/rockout` command: + +``` +### Ready-to-Run Fixes (HIGH severity only) + +1. **geotiff** — eager .values materialization (WILL OOM) + /rockout "Fix eager .values materialization in geotiff reader. + The dask read path at reader.py:87 calls .values which forces + the full array into memory. For 30TB inputs this will OOM on + a 16GB machine. Must stay lazy through the entire read path." + +2. **cost_distance** — iterative solver unbounded memory (WILL OOM) + /rockout "Fix cost_distance iterative solver to work within + bounded memory. Currently materializes the full distance matrix + each iteration. Must use chunked iteration for 30TB dask inputs." +``` + +Construct each `/rockout` command from the finding's description and fix fields. +Include the OOM verdict and bottleneck classification in the prompt text so +rockout has full context. +```` + +- [ ] **Step 2: Append Step 5 (state file update)** + +Append the following: + +````markdown +## Step 5 -- Update state file + +Write `.claude/performance-sweep-state.json` with the triage results: + +```json +{ + "last_triage": "", + "modules": { + "": { + "last_inspected": "", + "oom_verdict": "", + "bottleneck": "", + "high_count": , + "issue": null + } + } +} +``` + +If the file already exists, merge — update entries for modules that were +just audited, keep entries for modules not in this run's scope. + +If `--report-only` is set, stop here. Do not proceed to Step 6. +```` + +- [ ] **Step 3: Verify both steps appended** + +Run: `grep -c "^## Step" .claude/commands/sweep-performance.md` +Expected: `6` (Steps 0 through 5) + +- [ ] **Step 4: Commit** + +```bash +git add .claude/commands/sweep-performance.md +git commit -m "Add Phase 1 report merging and state update" +``` + +--- + +### Task 5: Phase 2 Ralph-Loop Generation (Step 6) + +**Files:** +- Modify: `.claude/commands/sweep-performance.md` + +- [ ] **Step 1: Append Step 6 (ralph-loop generation)** + +Append the following to `.claude/commands/sweep-performance.md`: + +````markdown +## Step 6 -- Generate the ralph-loop command + +Collect all modules from Step 4 (or from the state file if `--skip-phase1`) +that have at least one HIGH-severity finding and no `issue` recorded in the +state file (i.e. not yet fixed). + +Sort them by: WILL OOM first, then RISKY, then by HIGH count descending. + +Determine the benchmark array size from arguments: +- `--size small` → 128x128 +- `--size large` → 2048x2048 +- default → 512x512 + +### 6a. Print the ranked target list + +``` +### Phase 2 Targets (HIGH severity, unfixed) +| # | Module | HIGH Count | OOM Verdict | Bottleneck | +|---|---------------|------------|-------------|--------------| +| 1 | geotiff | 3 | WILL OOM | IO-bound | +| 2 | cost_distance | 1 | WILL OOM | memory-bound | +| 3 | viewshed | 2 | RISKY | memory-bound | +``` + +If no modules qualify, print: +"No HIGH-severity findings to fix. Run `/sweep-performance` without +`--skip-phase1` to refresh the triage." +Then stop. + +### 6b. Print the ralph-loop command + +Using the target list, generate and print: + +```` +/ralph-loop "Performance sweep Phase 2: benchmark and fix HIGH-severity findings. + +**Target modules in priority order:** +1. ( HIGH findings, ) -- +2. ... +... + +**For each module, in order:** + +1. Write a benchmark script at /tmp/perf_sweep_bench_.py that: + - Imports the module's public functions + - Creates a test array (x, float64) + - For EACH available backend (numpy, dask+numpy; cupy and dask+cupy only if available): + a. Wrap the array in the appropriate DataArray type + b. Measure wall time: timeit.repeat(number=1, repeat=3), take median + c. Measure Python memory: tracemalloc.start() / tracemalloc.get_traced_memory()[1] for peak + d. Measure process memory: resource.getrusage(RUSAGE_SELF).ru_maxrss before and after + e. For CuPy backends: cupy.get_default_memory_pool().used_bytes() before and after + - Print results as JSON to stdout + +2. Run the benchmark script and capture results. + +3. Confirm the HIGH finding from Phase 1: + - If the dask backend uses significantly more memory than expected for + the chunk size, or wall time shows a materialization stall: CONFIRMED. + - If the benchmark shows no anomaly: downgrade to MEDIUM in state file, + print 'False positive — skipping' and move to the next module. + +4. If confirmed: run /rockout to fix the issue end-to-end (issue, worktree, + implementation, tests, docs). Include the benchmark numbers in the + issue body for context. + +5. After rockout completes: rerun the same benchmark script. Print a + before/after comparison: + | Backend | Metric | Before | After | Ratio | Verdict | + |------------|-------------|--------|--------|-------|------------| + | numpy | wall_ms | 45.2 | 12.1 | 0.27x | IMPROVED | + | dask+numpy | peak_rss_mb | 892 | 34 | 0.04x | IMPROVED | + Thresholds: IMPROVED < 0.8x, REGRESSION > 1.2x, else UNCHANGED. + +6. Update .claude/performance-sweep-state.json with the issue number. + +7. Output ITERATION DONE + +If all targets have been addressed or confirmed as false positives: +ALL PERFORMANCE ISSUES FIXED." --max-iterations --completion-promise "ALL PERFORMANCE ISSUES FIXED" +```` + +Set `--max-iterations` to the number of target modules plus 2 (buffer for +retries). + +### 6c. Print reminder text + +``` +Phase 1 triage complete. To proceed with fixes: + Copy the ralph-loop command above and paste it. + +Other options: + Fix one manually: copy any /rockout command from the report above + Rerun triage only: /sweep-performance --report-only + Skip Phase 1: /sweep-performance --skip-phase1 (reuses last triage) + Reset all tracking: /sweep-performance --reset-state +``` +```` + +- [ ] **Step 2: Verify the full command file structure** + +Run: `grep "^## Step" .claude/commands/sweep-performance.md` +Expected output: +``` +## Step 0 -- Determine mode and parse arguments +## Step 1 -- Discover modules in scope +## Step 2 -- Gather metadata and score each module +## Step 3 -- Dispatch parallel subagents for static triage +## Step 4 -- Merge results and print the triage report +## Step 5 -- Update state file +## Step 6 -- Generate the ralph-loop command +``` + +- [ ] **Step 3: Commit** + +```bash +git add .claude/commands/sweep-performance.md +git commit -m "Add Phase 2 ralph-loop generation to sweep-performance" +``` + +--- + +### Task 6: General Rules and Final Polish + +**Files:** +- Modify: `.claude/commands/sweep-performance.md` + +- [ ] **Step 1: Append the General Rules section** + +Append the following to `.claude/commands/sweep-performance.md`: + +```markdown +--- + +## General Rules + +- Phase 1 subagents do NOT modify any source, test, or benchmark files. + Read-only analysis only. +- Phase 2 ralph-loop modifies code only through `/rockout`. +- Temporary benchmark scripts and graph simulation scripts go in `/tmp/` + with unique names including the module name (e.g. `/tmp/perf_sweep_bench_slope.py`, + `/tmp/perf_sweep_graph_slope.py`). Clean them up after capturing results. +- Only flag patterns that are ACTUALLY present in the code. Do not report + hypothetical issues or patterns that "could" occur. +- Include the exact file path and line number for every finding so the user + can navigate directly to the issue. +- False positives are worse than missed issues. If you are not confident a + pattern is actually harmful in context (e.g. `.values` used intentionally + on a known-numpy array), do not flag it. +- The 30TB simulation constructs the dask task graph only; it NEVER calls + `.compute()`. +- State file (`.claude/performance-sweep-state.json`) is gitignored by + convention — do not add it to git. +- If $ARGUMENTS is empty, use defaults: audit all modules, benchmark at + 512x512, generate ralph-loop for HIGH items. +- For subpackage modules (geotiff, reproject), the subagent should read ALL + `.py` files in the subpackage directory, not just `__init__.py`. +- When generating `/rockout` commands, include the OOM verdict, bottleneck + classification, and affected backends in the prompt text so rockout has + full performance context. +``` + +- [ ] **Step 2: Read the full file end-to-end and verify structure** + +Run: `wc -l .claude/commands/sweep-performance.md` +Expected: Roughly 300-400 lines. + +Run: `grep "^## " .claude/commands/sweep-performance.md` +Expected: Step 0 through Step 6, plus "General Rules". + +- [ ] **Step 3: Commit** + +```bash +git add .claude/commands/sweep-performance.md +git commit -m "Add general rules and finalize sweep-performance command" +``` + +--- + +### Task 7: Smoke Test the Command + +**Files:** +- Read: `.claude/commands/sweep-performance.md` + +- [ ] **Step 1: Verify the command appears in the slash command list** + +Run: `ls .claude/commands/sweep-performance.md` +Expected: File exists. + +- [ ] **Step 2: Verify all cross-references are consistent** + +Check that: +- The state file path `.claude/performance-sweep-state.json` is spelled + the same everywhere in the file. +- The subagent JSON output schema field names match the fields referenced + in Step 4 (report merging). +- The `/rockout` and `/ralph-loop` command syntax matches the patterns used + in the existing `accuracy-sweep.md` and `rockout.md` commands. + +Run: `grep -c "performance-sweep-state.json" .claude/commands/sweep-performance.md` +Expected: At least 4 occurrences (Steps 2, 5, 6, and General Rules). + +Run: `grep -c "/rockout" .claude/commands/sweep-performance.md` +Expected: At least 3 occurrences (Steps 4d, 6b, General Rules). + +Run: `grep -c "/ralph-loop" .claude/commands/sweep-performance.md` +Expected: At least 2 occurrences (Step 6b, Step 6c). + +- [ ] **Step 3: Verify subagent output schema fields match report consumption** + +These field names must appear in both the subagent prompt (Step 3) and the +report merging logic (Step 4): +- `module` +- `findings` (with `severity`, `category`, `file`, `line`, `description`, `fix`, `backends_affected`) +- `oom_verdict` (with `dask_numpy`, `dask_cupy`, `reasoning`) +- `bottleneck` + +Run: `grep -c '"oom_verdict"' .claude/commands/sweep-performance.md` +Expected: At least 2 (schema definition + state file). + +Run: `grep -c '"bottleneck"' .claude/commands/sweep-performance.md` +Expected: At least 2. + +- [ ] **Step 4: Final commit with all checks passing** + +```bash +git add .claude/commands/sweep-performance.md +git commit -m "Verify sweep-performance command integrity" +``` + +Only commit if there were fixes needed. If all checks passed with no +changes, skip this commit. diff --git a/docs/superpowers/specs/2026-03-31-sweep-performance-design.md b/docs/superpowers/specs/2026-03-31-sweep-performance-design.md new file mode 100644 index 00000000..d544e0af --- /dev/null +++ b/docs/superpowers/specs/2026-03-31-sweep-performance-design.md @@ -0,0 +1,368 @@ +# Sweep-Performance: Parallel Performance Triage and Fix Workflow + +**Date:** 2026-03-31 +**Status:** Draft + +## Overview + +A `/sweep-performance` slash command that audits every xrspatial module for +performance bottlenecks, OOM risk under large-scale dask workloads, and +backend-specific anti-patterns. Uses parallel subagents for fast static triage, +then a sequential ralph-loop to benchmark and fix confirmed HIGH-severity +issues. + +The central question for every dask backend: "If the data on disk was 30TB +and the machine only had 16GB of RAM, would this tool cause an out-of-memory +error?" + +## Scope + +All `.py` modules under `xrspatial/` plus the `geotiff/` and `reproject/` +subpackages. Excludes `__init__.py`, `_version.py`, `__main__.py`, `utils.py`, +`accessor.py`, `preview.py`, `dataset_support.py`, `diagnostics.py`, +`analytics.py`. + +## Architecture + +Two phases in a single invocation: + +``` +/sweep-performance + | + +-- Phase 1: Parallel Static Triage + | |-- Score & rank modules (git metadata + complexity heuristics) + | |-- Dispatch one subagent per module + | | |-- Static analysis (dask, GPU, memory, Numba patterns) + | | |-- 30TB/16GB OOM simulation (task graph construction, no compute) + | | +-- Return structured JSON findings + | |-- Merge results into ranked report + | +-- Update state file + | + +-- Phase 2: Ralph-Loop (HIGH severity only) + |-- Generate /ralph-loop command targeting HIGH modules + |-- Each iteration: + | |-- Real benchmarks (wall time, tracemalloc, RSS, CuPy pool) + | |-- Confirm finding is not false positive + | |-- /rockout to fix + | |-- Post-fix benchmark comparison + | +-- Update state file + +-- User pastes command to start +``` + +--- + +## Phase 1: Module Scoring + +For every module in scope, collect via git: + +| Field | Source | +|--------------------|-----------------------------------------------------------| +| `last_modified` | `git log -1 --format=%aI -- ` | +| `total_commits` | `git log --oneline -- \| wc -l` | +| `loc` | `wc -l < ` | +| `has_dask_backend` | grep for `_run_dask`, `map_overlap`, `map_blocks` | +| `has_cuda_backend` | grep for `@cuda.jit`, `import cupy` | +| `is_io_module` | module is in geotiff/ or reproject/ | +| `has_existing_bench` | matching file exists in `benchmarks/benchmarks/` | + +### Scoring Formula + +``` +days_since_inspected = (today - last_perf_inspected).days # 9999 if never +days_since_modified = (today - last_modified).days + +score = (days_since_inspected * 3) + + (loc * 0.1) + + (total_commits * 0.5) + + (has_dask_backend * 200) + + (has_cuda_backend * 150) + + (is_io_module * 300) + - (days_since_modified * 0.2) + - (has_existing_bench * 100) +``` + +Rationale: +- Never-inspected modules dominate (9999 * 3 = ~30,000). +- Dask and CUDA backends boosted: that is where OOM and perf bugs live. +- I/O modules get the highest boost: most relevant for 30TB question. +- Larger modules more likely to contain issues. +- Existing ASV benchmarks slightly deprioritize (perf already considered). + +--- + +## Phase 1: Subagent Static Analysis + +One subagent per module. Each performs the checks below and returns a +structured JSON blob. + +### Dask Path Analysis + +- `.values` on dask-backed DataArray (premature materialization) — **HIGH** +- `.compute()` inside a loop — **HIGH** +- `np.array()` / `np.asarray()` wrapping dask or CuPy array — **HIGH** +- `da.stack()` without `.rechunk()` — **MEDIUM** +- `map_overlap` with depth >= chunk_size / 4 — **MEDIUM** +- Missing `boundary` argument in `map_overlap` — **MEDIUM** +- Redundant computation (same function called twice on same input) — **MEDIUM** +- Python loops over dask chunks (serializes the graph) — **MEDIUM** + +### 30TB / 16GB OOM Verdict + +Two-part analysis for each dask code path: + +**Part 1 — Static trace.** Follow the dask code path and answer: does peak +memory scale with total array size, or with chunk size? If any step forces +full materialization, verdict is WILL OOM. + +**Part 2 — Task graph simulation.** Write and execute a script that: + +```python +import dask.array as da +import xarray as xr + +# Use a representative grid (2560x2560, 10x10 = 100 chunks) to inspect +# graph structure. The pattern is identical at any scale — what matters +# is whether the graph fans out, materializes, or stays chunk-local. +arr = da.zeros((2560, 2560), chunks=(256, 256), dtype='float64') +raster = xr.DataArray(arr, dims=['y', 'x']) + +# Call the function lazily +result = module_function(raster, **default_args) + +# Inspect the graph without executing +graph = result.__dask_graph__() +task_count = len(graph) +tasks_per_chunk = task_count / 100 # normalize to per-chunk + +# Check for fan-out patterns or full-materialization nodes +# Extrapolate to 30TB: ~57 million chunks at 256x256 float64 +# If tasks_per_chunk is constant => graph scales linearly => SAFE +# If any node depends on all chunks => full materialization => WILL OOM +``` + +The script constructs the graph only, never calls `.compute()`. Reports: +- Task count and tasks-per-chunk ratio +- Estimated peak memory per chunk (MB) +- Whether the graph contains fan-out or materialization nodes +- Extrapolation to 30TB: linear graph growth (SAFE) vs fan-out (WILL OOM) + +**Verdict**: `SAFE`, `RISKY` (bounded but tight), or `WILL OOM` (unbounded +or materializes). + +### GPU Transfer Analysis + +- `.data.get()` followed by CuPy ops (GPU-CPU-GPU round-trip) — **HIGH** +- `cupy.asarray()` inside a hot loop — **HIGH** +- Mixing NumPy/CuPy ops without reason — **MEDIUM** +- Register pressure: >20 float64 locals in `@cuda.jit` kernel — **MEDIUM** +- Thread blocks >16x16 on register-heavy kernels — **MEDIUM** + +### Memory Allocation Patterns + +- Unnecessary `.copy()` on arrays never mutated — **MEDIUM** +- `np.zeros_like()` + fill loop (could be `np.empty()`) — **LOW** +- Large temporary arrays that could be fused into the kernel — **MEDIUM** + +### Numba Anti-Patterns + +- Missing `@ngjit` on nested for-loops over `.data` arrays — **MEDIUM** +- `@jit` without `nopython=True` (object-mode fallback risk) — **MEDIUM** +- Type instability (int/float mixing in Numba functions) — **LOW** +- Column-major iteration on row-major arrays (cache-unfriendly) — **LOW** + +### Bottleneck Classification + +Based on static analysis, classify the module as one of: +- **IO-bound** — dominated by disk reads/writes or serialization +- **Memory-bound** — peak allocation is the limiting factor +- **Compute-bound** — CPU/GPU time dominates, memory is fine +- **Graph-bound** — dask task graph overhead dominates (too many small tasks) + +### Subagent Output Schema + +```json +{ + "module": "slope", + "files_read": ["xrspatial/slope.py"], + "findings": [ + { + "severity": "HIGH", + "category": "dask_materialization", + "file": "slope.py", + "line": 142, + "description": ".values on dask input in _run_dask", + "fix": "Use .data.compute() or restructure to stay lazy", + "backends_affected": ["dask+numpy", "dask+cupy"] + } + ], + "oom_verdict": { + "dask_numpy": "SAFE", + "dask_cupy": "SAFE", + "reasoning": "map_overlap with depth=1, memory bounded by chunk size", + "estimated_peak_per_chunk_mb": 0.5, + "task_count": 3721, + "graph_simulation_ran": true + }, + "bottleneck": "compute-bound", + "bottleneck_reasoning": "3x3 kernel with Numba JIT, no I/O, small overlap" +} +``` + +--- + +## Phase 1: Merged Report + +After all subagents return, print a consolidated report. + +### Module Risk Ranking Table + +``` +| Rank | Module | Score | OOM Verdict | Bottleneck | HIGH | MED | LOW | +|------|---------------|-------|-----------------|--------------|------|-----|-----| +| 1 | geotiff | 31200 | WILL OOM (d+np) | IO-bound | 3 | 1 | 0 | +| 2 | viewshed | 30050 | RISKY (d+np) | memory-bound | 2 | 2 | 1 | +| ... | ... | ... | ... | ... | ... | ... | ... | +``` + +### 30TB / 16GB Verdict Summary + +Grouped by verdict: + +- **WILL OOM (fix required):** list modules with reasoning +- **RISKY (bounded but tight):** list modules with reasoning +- **SAFE (memory bounded by chunk size):** list modules + +### Detailed Findings + +Per-module table of all findings grouped by severity (file:line, pattern, +description, fix). + +### Actionable Rockout Commands + +For each HIGH-severity finding, a ready-to-paste `/rockout` command. + +### State File Update + +Write `.claude/performance-sweep-state.json`: + +```json +{ + "last_triage": "2026-03-31T14:00:00Z", + "modules": { + "slope": { + "last_inspected": "2026-03-31T14:00:00Z", + "oom_verdict": "SAFE", + "bottleneck": "compute-bound", + "high_count": 0, + "issue": null + } + } +} +``` + +--- + +## Phase 2: Ralph-Loop for HIGH Severity Fixes + +Collect all modules with at least one HIGH-severity finding. Generate a +`/ralph-loop` command targeting them in priority order. + +### Each Iteration + +1. **Benchmark** the module on a moderate array (512x512 default) across all + available backends. Measure four metrics per backend per function: + - Wall time: `timeit.repeat(number=1, repeat=3)`, median + - Python memory: `tracemalloc.get_traced_memory()` peak + - Process memory: `resource.getrusage(RUSAGE_SELF).ru_maxrss` delta + - GPU memory (if CuPy): `cupy.get_default_memory_pool().used_bytes()` delta + +2. **Confirm the static finding** from Phase 1 is real. If the benchmark + shows the issue does not manifest (false positive), downgrade to MEDIUM + in the report and skip to next module. + +3. **Classify the bottleneck** with measured data: + - IO-bound: wall time dominated by read/write, low CPU + - Memory-bound: peak RSS much larger than expected for chunk size + - Compute-bound: CPU pegged, memory stable + - Graph-bound: dask task count extremely high, scheduler overhead visible + +4. **Run `/rockout`** to fix the confirmed issue (GitHub issue, worktree, + implementation, tests, docs). + +5. **Post-fix benchmark** — rerun the same benchmark. Report before/after + delta. + +6. **Update state** — record the fix in + `.claude/performance-sweep-state.json` with issue number. + +7. Output `ITERATION DONE`. + +### Generated Command Shape + +``` +/ralph-loop "Performance sweep Phase 2: benchmark and fix HIGH-severity findings. + +**Target modules in priority order:** +1. geotiff (3 HIGH findings, WILL OOM) -- eager .values materialization +2. cost_distance (1 HIGH finding, WILL OOM) -- iterative solver unbounded memory + +**For each module:** +1. Write and run a benchmark script measuring wall time, peak memory + (tracemalloc + RSS + CuPy pool) across all available backends +2. Confirm the HIGH finding from Phase 1 triage is real +3. If confirmed: run /rockout to fix it end-to-end +4. After rockout: rerun benchmark, report before/after delta +5. Update .claude/performance-sweep-state.json +6. Output ITERATION DONE + +If all targets addressed: ALL PERFORMANCE ISSUES FIXED." +--max-iterations {N+2} --completion-promise "ALL PERFORMANCE ISSUES FIXED" +``` + +### Reminder Text + +``` +Phase 1 triage complete. To proceed with fixes: + Copy the ralph-loop command above and paste it. + +Other options: + Fix one manually: copy any /rockout command from the report above + Rerun triage only: /sweep-performance --report-only + Skip Phase 1: /sweep-performance --skip-phase1 (reuses last triage) + Reset all tracking: /sweep-performance --reset-state +``` + +--- + +## Arguments + +| Argument | Effect | +|--------------------|------------------------------------------------------------| +| `--top N` | Limit Phase 1 subagents to top N scored modules (default: all) | +| `--exclude m1,m2` | Remove named modules from scope | +| `--only-terrain` | slope, aspect, curvature, terrain, terrain_metrics, hillshade, sky_view_factor | +| `--only-focal` | focal, convolution, morphology, bilateral, edge_detection, glcm | +| `--only-hydro` | flood, cost_distance, geodesic, surface_distance, viewshed, erosion, diffusion | +| `--only-io` | geotiff, reproject, rasterize, polygonize | +| `--reset-state` | Delete state file and start fresh | +| `--skip-phase1` | Reuse last triage state, go straight to ralph-loop generation | +| `--report-only` | Run Phase 1 only, no ralph-loop command | +| `--size small` | Benchmark at 128x128 | +| `--size large` | Benchmark at 2048x2048 | +| `--high-only` | Only report HIGH severity findings | + +Default (no arguments): audit all modules, benchmark at 512x512, generate +ralph-loop for HIGH items. + +--- + +## General Rules + +- Phase 1 subagents do NOT modify source files. Read-only analysis. +- Phase 2 ralph-loop modifies code only through `/rockout`. +- Temporary benchmark scripts go in `/tmp/` with unique names. +- Only flag patterns actually present in the code; no hypothetical issues. +- Include exact file path and line number for every finding. +- False positives are worse than missed issues. +- The 30TB simulation constructs the dask graph only; it never calls `.compute()`. +- State file (`.claude/performance-sweep-state.json`) is gitignored by convention. diff --git a/xrspatial/surface_distance.py b/xrspatial/surface_distance.py index 5ff8d03b..572181cc 100644 --- a/xrspatial/surface_distance.py +++ b/xrspatial/surface_distance.py @@ -305,6 +305,20 @@ def _precompute_dd_grid(lat_2d, lon_2d, dy, dx): """ H, W = lat_2d.shape n = len(dy) + # Memory guard: dd_grid is (n_neighbors, H, W) float64 + estimated = n * H * W * 8 + try: + from xrspatial.zonal import _available_memory_bytes + avail = _available_memory_bytes() + except ImportError: + avail = 2 * 1024**3 + if estimated > 0.8 * avail: + raise MemoryError( + f"Geodesic dd_grid needs ~{estimated / 1e9:.1f} GB " + f"({n} neighbors x {H}x{W} x 8 bytes) but only " + f"~{avail / 1e9:.1f} GB available. Use planar mode " + f"or downsample the raster." + ) dd_grid = np.zeros((n, H, W), dtype=np.float64) for i in range(n):