diff --git a/.claude/commands/sweep-performance.md b/.claude/commands/sweep-performance.md
new file mode 100644
index 00000000..2079f191
--- /dev/null
+++ b/.claude/commands/sweep-performance.md
@@ -0,0 +1,494 @@
+# Performance Sweep: Parallel Triage and Fix Workflow
+
+Audit xrspatial modules for performance bottlenecks, OOM risk under 30TB dask
+workloads, and backend-specific anti-patterns. Dispatches parallel subagents
+for fast triage, then generates a ralph-loop to benchmark and fix HIGH-severity
+issues.
+
+Optional arguments: $ARGUMENTS
+(e.g. `--top 5`, `--exclude slope,aspect`, `--only-io`, `--reset-state`)
+
+---
+
+## Step 0 -- Determine mode and parse arguments
+
+Parse $ARGUMENTS for these flags (multiple may combine):
+
+| Flag | Effect |
+|------|--------|
+| `--top N` | Limit Phase 1 to the top N scored modules (default: all) |
+| `--exclude mod1,mod2` | Remove named modules from scope |
+| `--only-terrain` | Restrict to: slope, aspect, curvature, terrain, terrain_metrics, hillshade, sky_view_factor |
+| `--only-focal` | Restrict to: focal, convolution, morphology, bilateral, edge_detection, glcm |
+| `--only-hydro` | Restrict to: flood, cost_distance, geodesic, surface_distance, viewshed, erosion, diffusion |
+| `--only-io` | Restrict to: geotiff, reproject, rasterize, polygonize |
+| `--reset-state` | Delete `.claude/performance-sweep-state.json` and treat all modules as never-inspected |
+| `--skip-phase1` | Skip triage; reuse last state file; go straight to ralph-loop generation for unresolved HIGH items |
+| `--report-only` | Run Phase 1 triage but do not generate a ralph-loop command |
+| `--size small` | Phase 2 benchmarks use 128x128 arrays |
+| `--size large` | Phase 2 benchmarks use 2048x2048 arrays |
+| `--high-only` | Only report HIGH severity findings in the triage output |
+
+If `--skip-phase1` is set, jump to Step 6 (ralph-loop generation).
+Otherwise proceed to Step 1.
+
+## Step 1 -- Discover modules in scope
+
+Enumerate all candidate modules. For each, record its file path(s):
+
+**Single-file modules:** Every `.py` file directly under `xrspatial/`, excluding
+`__init__.py`, `_version.py`, `__main__.py`, `utils.py`, `accessor.py`,
+`preview.py`, `dataset_support.py`, `diagnostics.py`, `analytics.py`.
+
+**Subpackage modules:** The `geotiff/` and `reproject/` directories under
+`xrspatial/`. Treat each subpackage as a single audit unit. List all `.py`
+files within each (excluding `__init__.py`).
+
+Apply `--only-*` and `--exclude` filters from Step 0 to narrow the list.
+
+Store the filtered module list in memory (do NOT write intermediate files).
+
+## Step 2 -- Gather metadata and score each module
+
+For every module in scope, collect:
+
+| Field | How |
+|-------|-----|
+| **last_modified** | `git log -1 --format=%aI -- <path>` (for subpackages, use the most recent file) |
+| **total_commits** | `git log --oneline -- <path> \| wc -l` |
+| **loc** | `wc -l < <path>` (for subpackages, sum all files) |
+| **has_dask_backend** | grep the file(s) for `_run_dask`, `map_overlap`, `map_blocks` |
+| **has_cuda_backend** | grep the file(s) for `@cuda.jit`, `import cupy` |
+| **is_io_module** | module is geotiff or reproject |
+| **has_existing_bench** | a file matching the module name exists in `benchmarks/benchmarks/` |
+
+### Load inspection state
+
+Read `.claude/performance-sweep-state.json`. If it does not exist, treat every
+module as never-inspected. If `--reset-state` was set, delete the file first.
+
+State file schema:
+
+```json
+{
+  "last_triage": "ISO-DATE",
+  "modules": {
+    "slope": {
+      "last_inspected": "ISO-DATE",
+      "oom_verdict": "SAFE",
+      "bottleneck": "compute-bound",
+      "high_count": 0,
+      "issue": null
+    }
+  }
+}
+```
+
+### Compute scores
+
+```
+days_since_inspected = (today - last_perf_inspected).days   # 9999 if never
+days_since_modified  = (today - last_modified).days
+
+score = (days_since_inspected * 3)
+      + (loc * 0.1)
+      + (total_commits * 0.5)
+      + (has_dask_backend * 200)
+      + (has_cuda_backend * 150)
+      + (is_io_module * 300)
+      - (days_since_modified * 0.2)
+      - (has_existing_bench * 100)
+```
+
+Sort modules by score descending. If `--top N` is set, keep only the top N.
+
+## Step 3 -- Dispatch parallel subagents for static triage
+
+For each module in the scored list, dispatch a subagent using the Agent tool.
+Launch ALL subagents in a single message (parallel dispatch). Each subagent
+receives the prompt below, with `MODULE_NAME` and `MODULE_FILES` substituted.
+
+**Subagent prompt template:**
+
+~~~
+You are auditing the xrspatial module "MODULE_NAME" for performance issues.
+
+Read these files: MODULE_FILES
+
+Perform ALL of the following analyses and return your findings as a single
+JSON object. Do NOT modify any files. This is read-only analysis.
+
+### 1. Dask Path Analysis
+
+Trace every dask code path (_run_dask, _run_dask_cupy, or any function that
+receives dask-backed DataArrays). Flag these patterns with severity:
+
+- HIGH: `.values` on a dask-backed DataArray or CuPy array (premature materialization)
+- HIGH: `.compute()` inside a loop (materializes full graph each iteration)
+- HIGH: `np.array()` or `np.asarray()` wrapping a dask or CuPy array
+- MEDIUM: `da.stack()` without a following `.rechunk()`
+- MEDIUM: `map_overlap` with depth >= chunk_size / 4
+- MEDIUM: Missing `boundary` argument in `map_overlap`
+- MEDIUM: Same function called twice on same input without caching
+- MEDIUM: Python `for` loop iterating over dask chunks (serializes the graph)
+
+If the module has NO dask code path, note "no dask backend" and skip.
+
+### 2. 30TB / 16GB OOM Verdict
+
+For each dask code path found in section 1:
+
+**Part A — Static trace:** Follow the code end-to-end. Answer: does peak
+memory scale with total array size, or with chunk size? If any operation
+forces full materialization, the verdict is WILL OOM.
+
+**Part B — Task graph simulation:** Write and run a Python script (in /tmp/
+with a unique name including "MODULE_NAME") that:
+
+```python
+import dask.array as da
+import xarray as xr
+import json, sys
+
+arr = da.zeros((2560, 2560), chunks=(256, 256), dtype='float64')
+raster = xr.DataArray(arr, dims=['y', 'x'])
+
+# Add coords if the function needs them (geodesic, slope with CRS, etc.)
+# raster = raster.assign_coords(x=np.linspace(-180, 180, 2560),
+#                                y=np.linspace(-90, 90, 2560))
+
+try:
+    result = MODULE_FUNCTION(raster, **DEFAULT_ARGS)
+    graph = result.__dask_graph__()
+    task_count = len(graph)
+    tasks_per_chunk = task_count / 100.0
+
+    # Check for fan-out: any task key that depends on more than 4 other tasks
+    deps = dict(graph)
+    max_fan_in = 0
+    for key, val in deps.items():
+        if hasattr(val, '__dask_graph__'):
+            sub = val.__dask_graph__()
+            max_fan_in = max(max_fan_in, len(sub))
+
+    print(json.dumps({
+        "success": True,
+        "task_count": task_count,
+        "tasks_per_chunk": round(tasks_per_chunk, 2),
+        "max_fan_in": max_fan_in,
+        "extrapolation_30tb": "~{} tasks at 57M chunks".format(
+            int(tasks_per_chunk * 57_000_000))
+    }))
+except Exception as e:
+    print(json.dumps({"success": False, "error": str(e)}))
+```
+
+Adapt the function call and imports for the specific module. Run the script
+and capture its JSON output. If it errors, record the error and rely on
+Part A alone.
+
+**Verdict:** One of:
+- `SAFE` — memory bounded by chunk size, graph scales linearly
+- `RISKY` — bounded but tight (e.g. large overlap depth, 3D intermediates)
+- `WILL OOM` — forces full materialization or unbounded memory growth
+
+### 3. GPU Transfer Analysis
+
+Scan for CuPy/CUDA code paths. Flag:
+
+- HIGH: `.data.get()` followed by CuPy operations (GPU-CPU-GPU round-trip)
+- HIGH: `cupy.asarray()` inside a loop (repeated CPU-GPU transfers)
+- MEDIUM: Mixing NumPy and CuPy ops in same function without clear reason
+- MEDIUM: Register pressure — count float64 local variables in `@cuda.jit`
+  kernels; flag if >20
+- MEDIUM: Thread blocks >16x16 on kernels with >20 float64 locals
+
+If the module has NO GPU code path, note "no GPU backend" and skip.
+
+### 4. Memory Allocation Patterns
+
+- MEDIUM: Unnecessary `.copy()` on arrays never mutated downstream
+- MEDIUM: Large temporary arrays that could be fused into the kernel
+- LOW: `np.zeros_like()` + fill loop where `np.empty()` would suffice
+
+### 5. Numba Anti-Patterns
+
+- MEDIUM: Missing `@ngjit` on nested for-loops over `.data` arrays
+- MEDIUM: `@jit` without `nopython=True` (object-mode fallback risk)
+- LOW: Type instability — initializing with int then assigning float
+- LOW: Column-major iteration on row-major arrays (inner loop should be last axis)
+
+### 6. Bottleneck Classification
+
+Based on your analysis, classify the module as ONE of:
+- `IO-bound` — dominated by disk reads/writes or serialization
+- `memory-bound` — peak allocation is the limiting factor
+- `compute-bound` — CPU/GPU time dominates, memory is fine
+- `graph-bound` — dask task graph overhead dominates
+
+### Output Format
+
+Return EXACTLY this JSON structure (no extra text before or after):
+
+```json
+{
+  "module": "MODULE_NAME",
+  "files_read": ["list of files you read"],
+  "findings": [
+    {
+      "severity": "HIGH|MEDIUM|LOW",
+      "category": "dask_materialization|dask_chunking|gpu_transfer|register_pressure|memory_allocation|numba_antipattern",
+      "file": "filename.py",
+      "line": 123,
+      "description": "what the issue is",
+      "fix": "how to fix it",
+      "backends_affected": ["dask+numpy", "dask+cupy", "cupy", "numpy"]
+    }
+  ],
+  "oom_verdict": {
+    "dask_numpy": "SAFE|RISKY|WILL OOM",
+    "dask_cupy": "SAFE|RISKY|WILL OOM",
+    "reasoning": "one-sentence explanation",
+    "estimated_peak_per_chunk_mb": 0.5,
+    "task_count": 3721,
+    "tasks_per_chunk": 37.21,
+    "graph_simulation_ran": true
+  },
+  "bottleneck": "compute-bound|memory-bound|IO-bound|graph-bound",
+  "bottleneck_reasoning": "one-sentence explanation"
+}
+```
+
+IMPORTANT: Only flag patterns that are ACTUALLY present in the code. Do not
+report hypothetical issues. False positives are worse than missed issues.
+If a pattern like `.values` is used on a known-numpy-only code path, do not
+flag it.
+~~~
+
+Wait for all subagents to return before proceeding to Step 4.
+
+## Step 4 -- Merge results and print the triage report
+
+Parse the JSON returned by each subagent. If a subagent returned malformed
+output, record the module as "audit failed" with a note.
+
+### 4a. Print the Module Risk Ranking Table
+
+Sort modules by score descending. Print:
+
+```
+## Performance Sweep — Static Triage Report
+
+### Module Risk Ranking
+| Rank | Module          | Score  | OOM Verdict     | Bottleneck    | HIGH | MED | LOW |
+|------|-----------------|--------|-----------------|---------------|------|-----|-----|
+| 1    | geotiff         | 31200  | WILL OOM (d+np) | IO-bound      | 3    | 1   | 0   |
+| 2    | viewshed        | 30050  | RISKY (d+np)    | memory-bound  | 2    | 2   | 1   |
+| ...  | ...             | ...    | ...             | ...           | ...  | ... | ... |
+```
+
+If `--high-only` is set, only count HIGH findings and omit modules with zero HIGH.
+
+### 4b. Print the 30TB / 16GB Verdict Summary
+
+Group modules by OOM verdict:
+
+```
+### 30TB on Disk / 16GB RAM — Out-of-Memory Analysis
+
+#### WILL OOM (fix required)
+- **module_name**: reasoning from subagent
+
+#### RISKY (bounded but tight)
+- **module_name**: reasoning from subagent
+
+#### SAFE (memory bounded by chunk size)
+- module_name, module_name, module_name, ...
+```
+
+### 4c. Print Detailed Findings
+
+For each module that has findings, print a severity-grouped table:
+
+```
+### module_name (bottleneck: compute-bound, OOM: SAFE)
+
+| # | Severity | File:Line      | Category                | Description                  | Fix                           |
+|---|----------|----------------|-------------------------|------------------------------|-------------------------------|
+| 1 | HIGH     | slope.py:142   | dask_materialization    | .values on dask input        | Use .data or stay lazy        |
+| 2 | MEDIUM   | slope.py:88    | dask_chunking           | map_overlap depth too large  | Reduce depth or warn users    |
+```
+
+### 4d. Print Actionable Rockout Commands
+
+For each HIGH-severity finding, print a ready-to-paste `/rockout` command:
+
+```
+### Ready-to-Run Fixes (HIGH severity only)
+
+1. **geotiff** — eager .values materialization (WILL OOM)
+   /rockout "Fix eager .values materialization in geotiff reader.
+   The dask read path at reader.py:87 calls .values which forces
+   the full array into memory. For 30TB inputs this will OOM on
+   a 16GB machine. Must stay lazy through the entire read path."
+
+2. **cost_distance** — iterative solver unbounded memory (WILL OOM)
+   /rockout "Fix cost_distance iterative solver to work within
+   bounded memory. Currently materializes the full distance matrix
+   each iteration. Must use chunked iteration for 30TB dask inputs."
+```
+
+Construct each `/rockout` command from the finding's description and fix fields.
+Include the OOM verdict and bottleneck classification in the prompt text so
+rockout has full context.
+
+## Step 5 -- Update state file
+
+Write `.claude/performance-sweep-state.json` with the triage results:
+
+```json
+{
+  "last_triage": "<current ISO datetime>",
+  "modules": {
+    "<module_name>": {
+      "last_inspected": "<current ISO datetime>",
+      "oom_verdict": "<SAFE|RISKY|WILL OOM>",
+      "bottleneck": "<IO-bound|memory-bound|compute-bound|graph-bound>",
+      "high_count": "<number of HIGH findings>",
+      "issue": null
+    }
+  }
+}
+```
+
+If the file already exists, merge — update entries for modules that were
+just audited, keep entries for modules not in this run's scope.
+
+If `--report-only` is set, stop here. Do not proceed to Step 6.
+
+## Step 6 -- Generate the ralph-loop command
+
+Collect all modules from Step 4 (or from the state file if `--skip-phase1`)
+that have at least one HIGH-severity finding and no `issue` recorded in the
+state file (i.e. not yet fixed).
+
+Sort them by: WILL OOM first, then RISKY, then by HIGH count descending.
+
+Determine the benchmark array size from arguments:
+- `--size small` → 128x128
+- `--size large` → 2048x2048
+- default → 512x512
+
+### 6a. Print the ranked target list
+
+```
+### Phase 2 Targets (HIGH severity, unfixed)
+| # | Module        | HIGH Count | OOM Verdict | Bottleneck   |
+|---|---------------|------------|-------------|--------------|
+| 1 | geotiff       | 3          | WILL OOM    | IO-bound     |
+| 2 | cost_distance | 1          | WILL OOM    | memory-bound |
+| 3 | viewshed      | 2          | RISKY       | memory-bound |
+```
+
+If no modules qualify, print:
+"No HIGH-severity findings to fix. Run `/sweep-performance` without
+`--skip-phase1` to refresh the triage."
+Then stop.
+
+### 6b. Print the ralph-loop command
+
+Using the target list, generate and print:
+
+````
+/ralph-loop "Performance sweep Phase 2: benchmark and fix HIGH-severity findings.
+
+**Target modules in priority order:**
+1. <module> (<N> HIGH findings, <OOM verdict>) -- <one-line summary of worst finding>
+2. <module> ...
+...
+
+**For each module, in order:**
+
+1. Write a benchmark script at /tmp/perf_sweep_bench_<module>.py that:
+   - Imports the module's public functions
+   - Creates a test array (<SIZE>x<SIZE>, float64)
+   - For EACH available backend (numpy, dask+numpy; cupy and dask+cupy only if available):
+     a. Wrap the array in the appropriate DataArray type
+     b. Measure wall time: timeit.repeat(number=1, repeat=3), take median
+     c. Measure Python memory: tracemalloc.start() / tracemalloc.get_traced_memory()[1] for peak
+     d. Measure process memory: resource.getrusage(RUSAGE_SELF).ru_maxrss before and after
+     e. For CuPy backends: cupy.get_default_memory_pool().used_bytes() before and after
+   - Print results as JSON to stdout
+
+2. Run the benchmark script and capture results.
+
+3. Confirm the HIGH finding from Phase 1:
+   - If the dask backend uses significantly more memory than expected for
+     the chunk size, or wall time shows a materialization stall: CONFIRMED.
+   - If the benchmark shows no anomaly: downgrade to MEDIUM in state file,
+     print 'False positive — skipping' and move to the next module.
+
+4. If confirmed: run /rockout to fix the issue end-to-end (issue, worktree,
+   implementation, tests, docs). Include the benchmark numbers in the
+   issue body for context.
+
+5. After rockout completes: rerun the same benchmark script. Print a
+   before/after comparison:
+   | Backend    | Metric       | Before | After  | Ratio | Verdict    |
+   |------------|-------------|--------|--------|-------|------------|
+   | numpy      | wall_ms     | 45.2   | 12.1   | 0.27x | IMPROVED   |
+   | dask+numpy | peak_rss_mb | 892    | 34     | 0.04x | IMPROVED   |
+   Thresholds: IMPROVED < 0.8x, REGRESSION > 1.2x, else UNCHANGED.
+
+6. Update .claude/performance-sweep-state.json with the issue number.
+
+7. Output <promise>ITERATION DONE</promise>
+
+If all targets have been addressed or confirmed as false positives:
+<promise>ALL PERFORMANCE ISSUES FIXED</promise>." --max-iterations <N+2> --completion-promise "ALL PERFORMANCE ISSUES FIXED"
+````
+
+Set `--max-iterations` to the number of target modules plus 2 (buffer for
+retries).
+
+### 6c. Print reminder text
+
+```
+Phase 1 triage complete. To proceed with fixes:
+  Copy the ralph-loop command above and paste it.
+
+Other options:
+  Fix one manually:    copy any /rockout command from the report above
+  Rerun triage only:   /sweep-performance --report-only
+  Skip Phase 1:        /sweep-performance --skip-phase1 (reuses last triage)
+  Reset all tracking:  /sweep-performance --reset-state
+```
+
+---
+
+## General Rules
+
+- Phase 1 subagents do NOT modify any source, test, or benchmark files.
+  Read-only analysis only.
+- Phase 2 ralph-loop modifies code only through `/rockout`.
+- Temporary benchmark scripts and graph simulation scripts go in `/tmp/`
+  with unique names including the module name (e.g. `/tmp/perf_sweep_bench_slope.py`,
+  `/tmp/perf_sweep_graph_slope.py`). Clean them up after capturing results.
+- Only flag patterns that are ACTUALLY present in the code. Do not report
+  hypothetical issues or patterns that "could" occur.
+- Include the exact file path and line number for every finding so the user
+  can navigate directly to the issue.
+- False positives are worse than missed issues. If you are not confident a
+  pattern is actually harmful in context (e.g. `.values` used intentionally
+  on a known-numpy array), do not flag it.
+- The 30TB simulation constructs the dask task graph only; it NEVER calls
+  `.compute()`.
+- State file (`.claude/performance-sweep-state.json`) is gitignored by
+  convention — do not add it to git.
+- If $ARGUMENTS is empty, use defaults: audit all modules, benchmark at
+  512x512, generate ralph-loop for HIGH items.
+- For subpackage modules (geotiff, reproject), the subagent should read ALL
+  `.py` files in the subpackage directory, not just `__init__.py`.
+- When generating `/rockout` commands, include the OOM verdict, bottleneck
+  classification, and affected backends in the prompt text so rockout has
+  full performance context.
diff --git a/docs/superpowers/plans/2026-03-31-sweep-performance.md b/docs/superpowers/plans/2026-03-31-sweep-performance.md
new file mode 100644
index 00000000..8615a41e
--- /dev/null
+++ b/docs/superpowers/plans/2026-03-31-sweep-performance.md
@@ -0,0 +1,743 @@
+# Sweep-Performance Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Create a `/sweep-performance` slash command that audits all xrspatial modules for performance bottlenecks, OOM risk under 30TB dask workloads, and backend anti-patterns using parallel subagents, then generates a ralph-loop to fix HIGH-severity issues.
+
+**Architecture:** Single command file (`.claude/commands/sweep-performance.md`) containing all instructions for both phases. Phase 1 dispatches parallel subagents via the Agent tool for static analysis + 30TB graph simulation. Phase 2 generates a `/ralph-loop` command targeting HIGH-severity modules for real benchmarks and `/rockout` fixes. State persisted in `.claude/performance-sweep-state.json`.
+
+**Tech Stack:** Claude Code slash commands (markdown), Agent tool for subagent dispatch, Bash for git metadata and benchmark scripts, dask for graph simulation, tracemalloc/resource/cupy for memory measurement.
+
+---
+
+## File Structure
+
+| File | Purpose |
+|------|---------|
+| Create: `.claude/commands/sweep-performance.md` | The slash command — all Phase 1 and Phase 2 logic |
+| Create: `.claude/performance-sweep-state.json` | Runtime state file (created by the command at execution time, not committed) |
+
+This is a single-file deliverable. The command file contains all the instructions that Claude follows when `/sweep-performance` is invoked. No Python code, no library files — just a well-structured prompt document, same pattern as `accuracy-sweep.md`.
+
+---
+
+### Task 1: Scaffold the Command Header and Argument Parsing
+
+**Files:**
+- Create: `.claude/commands/sweep-performance.md`
+
+- [ ] **Step 1: Create the command file with title, description, and argument parsing**
+
+```markdown
+# Performance Sweep: Parallel Triage and Fix Workflow
+
+Audit xrspatial modules for performance bottlenecks, OOM risk under 30TB dask
+workloads, and backend-specific anti-patterns. Dispatches parallel subagents
+for fast triage, then generates a ralph-loop to benchmark and fix HIGH-severity
+issues.
+
+Optional arguments: $ARGUMENTS
+(e.g. `--top 5`, `--exclude slope,aspect`, `--only-io`, `--reset-state`)
+
+---
+
+## Step 0 -- Determine mode and parse arguments
+
+Parse $ARGUMENTS for these flags (multiple may combine):
+
+| Flag | Effect |
+|------|--------|
+| `--top N` | Limit Phase 1 to the top N scored modules (default: all) |
+| `--exclude mod1,mod2` | Remove named modules from scope |
+| `--only-terrain` | Restrict to: slope, aspect, curvature, terrain, terrain_metrics, hillshade, sky_view_factor |
+| `--only-focal` | Restrict to: focal, convolution, morphology, bilateral, edge_detection, glcm |
+| `--only-hydro` | Restrict to: flood, cost_distance, geodesic, surface_distance, viewshed, erosion, diffusion |
+| `--only-io` | Restrict to: geotiff, reproject, rasterize, polygonize |
+| `--reset-state` | Delete `.claude/performance-sweep-state.json` and treat all modules as never-inspected |
+| `--skip-phase1` | Skip triage; reuse last state file; go straight to ralph-loop generation for unresolved HIGH items |
+| `--report-only` | Run Phase 1 triage but do not generate a ralph-loop command |
+| `--size small` | Phase 2 benchmarks use 128x128 arrays |
+| `--size large` | Phase 2 benchmarks use 2048x2048 arrays |
+| `--high-only` | Only report HIGH severity findings in the triage output |
+
+If `--skip-phase1` is set, jump to Step 6 (ralph-loop generation).
+Otherwise proceed to Step 1.
+```
+
+- [ ] **Step 2: Verify the file was created correctly**
+
+Run: `head -40 .claude/commands/sweep-performance.md`
+Expected: The title, description, and Step 0 argument table are present.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add .claude/commands/sweep-performance.md
+git commit -m "Add sweep-performance command scaffold with argument parsing"
+```
+
+---
+
+### Task 2: Module Discovery and Scoring (Step 1-2)
+
+**Files:**
+- Modify: `.claude/commands/sweep-performance.md`
+
+- [ ] **Step 1: Append Step 1 (module discovery) to the command file**
+
+Append the following to `.claude/commands/sweep-performance.md`:
+
+```markdown
+## Step 1 -- Discover modules in scope
+
+Enumerate all candidate modules. For each, record its file path(s):
+
+**Single-file modules:** Every `.py` file directly under `xrspatial/`, excluding
+`__init__.py`, `_version.py`, `__main__.py`, `utils.py`, `accessor.py`,
+`preview.py`, `dataset_support.py`, `diagnostics.py`, `analytics.py`.
+
+**Subpackage modules:** The `geotiff/` and `reproject/` directories under
+`xrspatial/`. Treat each subpackage as a single audit unit. List all `.py`
+files within each (excluding `__init__.py`).
+
+Apply `--only-*` and `--exclude` filters from Step 0 to narrow the list.
+
+Store the filtered module list in memory (do NOT write intermediate files).
+```
+
+- [ ] **Step 2: Append Step 2 (git metadata and scoring) to the command file**
+
+Append the following:
+
+```markdown
+## Step 2 -- Gather metadata and score each module
+
+For every module in scope, collect:
+
+| Field | How |
+|-------|-----|
+| **last_modified** | `git log -1 --format=%aI -- <path>` (for subpackages, use the most recent file) |
+| **total_commits** | `git log --oneline -- <path> \| wc -l` |
+| **loc** | `wc -l < <path>` (for subpackages, sum all files) |
+| **has_dask_backend** | grep the file(s) for `_run_dask`, `map_overlap`, `map_blocks` |
+| **has_cuda_backend** | grep the file(s) for `@cuda.jit`, `import cupy` |
+| **is_io_module** | module is geotiff or reproject |
+| **has_existing_bench** | a file matching the module name exists in `benchmarks/benchmarks/` |
+
+### Load inspection state
+
+Read `.claude/performance-sweep-state.json`. If it does not exist, treat every
+module as never-inspected. If `--reset-state` was set, delete the file first.
+
+State file schema:
+
+```json
+{
+  "last_triage": "ISO-DATE",
+  "modules": {
+    "slope": {
+      "last_inspected": "ISO-DATE",
+      "oom_verdict": "SAFE",
+      "bottleneck": "compute-bound",
+      "high_count": 0,
+      "issue": null
+    }
+  }
+}
+```
+
+### Compute scores
+
+```
+days_since_inspected = (today - last_perf_inspected).days   # 9999 if never
+days_since_modified  = (today - last_modified).days
+
+score = (days_since_inspected * 3)
+      + (loc * 0.1)
+      + (total_commits * 0.5)
+      + (has_dask_backend * 200)
+      + (has_cuda_backend * 150)
+      + (is_io_module * 300)
+      - (days_since_modified * 0.2)
+      - (has_existing_bench * 100)
+```
+
+Sort modules by score descending. If `--top N` is set, keep only the top N.
+```
+
+- [ ] **Step 3: Verify the appended steps read correctly**
+
+Run: `grep -c "^## Step" .claude/commands/sweep-performance.md`
+Expected: `3` (Step 0, Step 1, Step 2)
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add .claude/commands/sweep-performance.md
+git commit -m "Add module discovery and scoring to sweep-performance"
+```
+
+---
+
+### Task 3: Phase 1 Subagent Dispatch (Step 3)
+
+**Files:**
+- Modify: `.claude/commands/sweep-performance.md`
+
+- [ ] **Step 1: Append Step 3 (subagent dispatch and analysis instructions)**
+
+Append the following to `.claude/commands/sweep-performance.md`:
+
+````markdown
+## Step 3 -- Dispatch parallel subagents for static triage
+
+For each module in the scored list, dispatch a subagent using the Agent tool.
+Launch ALL subagents in a single message (parallel dispatch). Each subagent
+receives the prompt below, with `MODULE_NAME` and `MODULE_FILES` substituted.
+
+**Subagent prompt template:**
+
+```
+You are auditing the xrspatial module "MODULE_NAME" for performance issues.
+
+Read these files: MODULE_FILES
+
+Perform ALL of the following analyses and return your findings as a single
+JSON object. Do NOT modify any files. This is read-only analysis.
+
+## 1. Dask Path Analysis
+
+Trace every dask code path (_run_dask, _run_dask_cupy, or any function that
+receives dask-backed DataArrays). Flag these patterns with severity:
+
+- HIGH: `.values` on a dask-backed DataArray or CuPy array (premature materialization)
+- HIGH: `.compute()` inside a loop (materializes full graph each iteration)
+- HIGH: `np.array()` or `np.asarray()` wrapping a dask or CuPy array
+- MEDIUM: `da.stack()` without a following `.rechunk()`
+- MEDIUM: `map_overlap` with depth >= chunk_size / 4
+- MEDIUM: Missing `boundary` argument in `map_overlap`
+- MEDIUM: Same function called twice on same input without caching
+- MEDIUM: Python `for` loop iterating over dask chunks (serializes the graph)
+
+If the module has NO dask code path, note "no dask backend" and skip.
+
+## 2. 30TB / 16GB OOM Verdict
+
+For each dask code path found in section 1:
+
+**Part A — Static trace:** Follow the code end-to-end. Answer: does peak
+memory scale with total array size, or with chunk size? If any operation
+forces full materialization, the verdict is WILL OOM.
+
+**Part B — Task graph simulation:** Write and run a Python script (in /tmp/
+with a unique name including "MODULE_NAME") that:
+
+```python
+import dask.array as da
+import xarray as xr
+import json, sys
+
+arr = da.zeros((2560, 2560), chunks=(256, 256), dtype='float64')
+raster = xr.DataArray(arr, dims=['y', 'x'])
+
+# Add coords if the function needs them (geodesic, slope with CRS, etc.)
+# raster = raster.assign_coords(x=np.linspace(-180, 180, 2560),
+#                                y=np.linspace(-90, 90, 2560))
+
+try:
+    result = MODULE_FUNCTION(raster, **DEFAULT_ARGS)
+    graph = result.__dask_graph__()
+    task_count = len(graph)
+    tasks_per_chunk = task_count / 100.0
+
+    # Check for fan-out: any task key that depends on more than 4 other tasks
+    deps = dict(graph)
+    max_fan_in = 0
+    for key, val in deps.items():
+        if hasattr(val, '__dask_graph__'):
+            sub = val.__dask_graph__()
+            max_fan_in = max(max_fan_in, len(sub))
+
+    print(json.dumps({
+        "success": True,
+        "task_count": task_count,
+        "tasks_per_chunk": round(tasks_per_chunk, 2),
+        "max_fan_in": max_fan_in,
+        "extrapolation_30tb": "~{} tasks at 57M chunks".format(
+            int(tasks_per_chunk * 57_000_000))
+    }))
+except Exception as e:
+    print(json.dumps({"success": False, "error": str(e)}))
+```
+
+Adapt the function call and imports for the specific module. Run the script
+and capture its JSON output. If it errors, record the error and rely on
+Part A alone.
+
+**Verdict:** One of:
+- `SAFE` — memory bounded by chunk size, graph scales linearly
+- `RISKY` — bounded but tight (e.g. large overlap depth, 3D intermediates)
+- `WILL OOM` — forces full materialization or unbounded memory growth
+
+## 3. GPU Transfer Analysis
+
+Scan for CuPy/CUDA code paths. Flag:
+
+- HIGH: `.data.get()` followed by CuPy operations (GPU-CPU-GPU round-trip)
+- HIGH: `cupy.asarray()` inside a loop (repeated CPU-GPU transfers)
+- MEDIUM: Mixing NumPy and CuPy ops in same function without clear reason
+- MEDIUM: Register pressure — count float64 local variables in `@cuda.jit`
+  kernels; flag if >20
+- MEDIUM: Thread blocks >16x16 on kernels with >20 float64 locals
+
+If the module has NO GPU code path, note "no GPU backend" and skip.
+
+## 4. Memory Allocation Patterns
+
+- MEDIUM: Unnecessary `.copy()` on arrays never mutated downstream
+- MEDIUM: Large temporary arrays that could be fused into the kernel
+- LOW: `np.zeros_like()` + fill loop where `np.empty()` would suffice
+
+## 5. Numba Anti-Patterns
+
+- MEDIUM: Missing `@ngjit` on nested for-loops over `.data` arrays
+- MEDIUM: `@jit` without `nopython=True` (object-mode fallback risk)
+- LOW: Type instability — initializing with int then assigning float
+- LOW: Column-major iteration on row-major arrays (inner loop should be last axis)
+
+## 6. Bottleneck Classification
+
+Based on your analysis, classify the module as ONE of:
+- `IO-bound` — dominated by disk reads/writes or serialization
+- `memory-bound` — peak allocation is the limiting factor
+- `compute-bound` — CPU/GPU time dominates, memory is fine
+- `graph-bound` — dask task graph overhead dominates
+
+## Output Format
+
+Return EXACTLY this JSON structure (no extra text before or after):
+
+```json
+{
+  "module": "MODULE_NAME",
+  "files_read": ["list of files you read"],
+  "findings": [
+    {
+      "severity": "HIGH|MEDIUM|LOW",
+      "category": "dask_materialization|dask_chunking|gpu_transfer|register_pressure|memory_allocation|numba_antipattern",
+      "file": "filename.py",
+      "line": 123,
+      "description": "what the issue is",
+      "fix": "how to fix it",
+      "backends_affected": ["dask+numpy", "dask+cupy", "cupy", "numpy"]
+    }
+  ],
+  "oom_verdict": {
+    "dask_numpy": "SAFE|RISKY|WILL OOM",
+    "dask_cupy": "SAFE|RISKY|WILL OOM",
+    "reasoning": "one-sentence explanation",
+    "estimated_peak_per_chunk_mb": 0.5,
+    "task_count": 3721,
+    "tasks_per_chunk": 37.21,
+    "graph_simulation_ran": true
+  },
+  "bottleneck": "compute-bound|memory-bound|IO-bound|graph-bound",
+  "bottleneck_reasoning": "one-sentence explanation"
+}
+```
+
+IMPORTANT: Only flag patterns that are ACTUALLY present in the code. Do not
+report hypothetical issues. False positives are worse than missed issues.
+If a pattern like `.values` is used on a known-numpy-only code path, do not
+flag it.
+```
+
+Wait for all subagents to return before proceeding to Step 4.
+````
+
+- [ ] **Step 2: Verify the subagent prompt is well-formed**
+
+Run: `grep -c "## [0-9]" .claude/commands/sweep-performance.md`
+Expected: At least 6 (the six analysis sections inside the subagent prompt)
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add .claude/commands/sweep-performance.md
+git commit -m "Add Phase 1 subagent dispatch and analysis template"
+```
+
+---
+
+### Task 4: Phase 1 Report Merging and State Update (Steps 4-5)
+
+**Files:**
+- Modify: `.claude/commands/sweep-performance.md`
+
+- [ ] **Step 1: Append Step 4 (merge subagent results into report)**
+
+Append the following to `.claude/commands/sweep-performance.md`:
+
+````markdown
+## Step 4 -- Merge results and print the triage report
+
+Parse the JSON returned by each subagent. If a subagent returned malformed
+output, record the module as "audit failed" with a note.
+
+### 4a. Print the Module Risk Ranking Table
+
+Sort modules by score descending. Print:
+
+```
+## Performance Sweep — Static Triage Report
+
+### Module Risk Ranking
+| Rank | Module          | Score  | OOM Verdict     | Bottleneck    | HIGH | MED | LOW |
+|------|-----------------|--------|-----------------|---------------|------|-----|-----|
+| 1    | geotiff         | 31200  | WILL OOM (d+np) | IO-bound      | 3    | 1   | 0   |
+| 2    | viewshed        | 30050  | RISKY (d+np)    | memory-bound  | 2    | 2   | 1   |
+| ...  | ...             | ...    | ...             | ...           | ...  | ... | ... |
+```
+
+If `--high-only` is set, only count HIGH findings and omit modules with zero HIGH.
+
+### 4b. Print the 30TB / 16GB Verdict Summary
+
+Group modules by OOM verdict:
+
+```
+### 30TB on Disk / 16GB RAM — Out-of-Memory Analysis
+
+#### WILL OOM (fix required)
+- **module_name**: reasoning from subagent
+
+#### RISKY (bounded but tight)
+- **module_name**: reasoning from subagent
+
+#### SAFE (memory bounded by chunk size)
+- module_name, module_name, module_name, ...
+```
+
+### 4c. Print Detailed Findings
+
+For each module that has findings, print a severity-grouped table:
+
+```
+### module_name (bottleneck: compute-bound, OOM: SAFE)
+
+| # | Severity | File:Line      | Category                | Description                  | Fix                           |
+|---|----------|----------------|-------------------------|------------------------------|-------------------------------|
+| 1 | HIGH     | slope.py:142   | dask_materialization    | .values on dask input        | Use .data or stay lazy        |
+| 2 | MEDIUM   | slope.py:88    | dask_chunking           | map_overlap depth too large  | Reduce depth or warn users    |
+```
+
+### 4d. Print Actionable Rockout Commands
+
+For each HIGH-severity finding, print a ready-to-paste `/rockout` command:
+
+```
+### Ready-to-Run Fixes (HIGH severity only)
+
+1. **geotiff** — eager .values materialization (WILL OOM)
+   /rockout "Fix eager .values materialization in geotiff reader.
+   The dask read path at reader.py:87 calls .values which forces
+   the full array into memory. For 30TB inputs this will OOM on
+   a 16GB machine. Must stay lazy through the entire read path."
+
+2. **cost_distance** — iterative solver unbounded memory (WILL OOM)
+   /rockout "Fix cost_distance iterative solver to work within
+   bounded memory. Currently materializes the full distance matrix
+   each iteration. Must use chunked iteration for 30TB dask inputs."
+```
+
+Construct each `/rockout` command from the finding's description and fix fields.
+Include the OOM verdict and bottleneck classification in the prompt text so
+rockout has full context.
+````
+
+- [ ] **Step 2: Append Step 5 (state file update)**
+
+Append the following:
+
+````markdown
+## Step 5 -- Update state file
+
+Write `.claude/performance-sweep-state.json` with the triage results:
+
+```json
+{
+  "last_triage": "<current ISO datetime>",
+  "modules": {
+    "<module_name>": {
+      "last_inspected": "<current ISO datetime>",
+      "oom_verdict": "<SAFE|RISKY|WILL OOM>",
+      "bottleneck": "<IO-bound|memory-bound|compute-bound|graph-bound>",
+      "high_count": <number of HIGH findings>,
+      "issue": null
+    }
+  }
+}
+```
+
+If the file already exists, merge — update entries for modules that were
+just audited, keep entries for modules not in this run's scope.
+
+If `--report-only` is set, stop here. Do not proceed to Step 6.
+````
+
+- [ ] **Step 3: Verify both steps appended**
+
+Run: `grep -c "^## Step" .claude/commands/sweep-performance.md`
+Expected: `6` (Steps 0 through 5)
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add .claude/commands/sweep-performance.md
+git commit -m "Add Phase 1 report merging and state update"
+```
+
+---
+
+### Task 5: Phase 2 Ralph-Loop Generation (Step 6)
+
+**Files:**
+- Modify: `.claude/commands/sweep-performance.md`
+
+- [ ] **Step 1: Append Step 6 (ralph-loop generation)**
+
+Append the following to `.claude/commands/sweep-performance.md`:
+
+````markdown
+## Step 6 -- Generate the ralph-loop command
+
+Collect all modules from Step 4 (or from the state file if `--skip-phase1`)
+that have at least one HIGH-severity finding and no `issue` recorded in the
+state file (i.e. not yet fixed).
+
+Sort them by: WILL OOM first, then RISKY, then by HIGH count descending.
+
+Determine the benchmark array size from arguments:
+- `--size small` → 128x128
+- `--size large` → 2048x2048
+- default → 512x512
+
+### 6a. Print the ranked target list
+
+```
+### Phase 2 Targets (HIGH severity, unfixed)
+| # | Module        | HIGH Count | OOM Verdict | Bottleneck   |
+|---|---------------|------------|-------------|--------------|
+| 1 | geotiff       | 3          | WILL OOM    | IO-bound     |
+| 2 | cost_distance | 1          | WILL OOM    | memory-bound |
+| 3 | viewshed      | 2          | RISKY       | memory-bound |
+```
+
+If no modules qualify, print:
+"No HIGH-severity findings to fix. Run `/sweep-performance` without
+`--skip-phase1` to refresh the triage."
+Then stop.
+
+### 6b. Print the ralph-loop command
+
+Using the target list, generate and print:
+
+````
+/ralph-loop "Performance sweep Phase 2: benchmark and fix HIGH-severity findings.
+
+**Target modules in priority order:**
+1. <module> (<N> HIGH findings, <OOM verdict>) -- <one-line summary of worst finding>
+2. <module> ...
+...
+
+**For each module, in order:**
+
+1. Write a benchmark script at /tmp/perf_sweep_bench_<module>.py that:
+   - Imports the module's public functions
+   - Creates a test array (<SIZE>x<SIZE>, float64)
+   - For EACH available backend (numpy, dask+numpy; cupy and dask+cupy only if available):
+     a. Wrap the array in the appropriate DataArray type
+     b. Measure wall time: timeit.repeat(number=1, repeat=3), take median
+     c. Measure Python memory: tracemalloc.start() / tracemalloc.get_traced_memory()[1] for peak
+     d. Measure process memory: resource.getrusage(RUSAGE_SELF).ru_maxrss before and after
+     e. For CuPy backends: cupy.get_default_memory_pool().used_bytes() before and after
+   - Print results as JSON to stdout
+
+2. Run the benchmark script and capture results.
+
+3. Confirm the HIGH finding from Phase 1:
+   - If the dask backend uses significantly more memory than expected for
+     the chunk size, or wall time shows a materialization stall: CONFIRMED.
+   - If the benchmark shows no anomaly: downgrade to MEDIUM in state file,
+     print 'False positive — skipping' and move to the next module.
+
+4. If confirmed: run /rockout to fix the issue end-to-end (issue, worktree,
+   implementation, tests, docs). Include the benchmark numbers in the
+   issue body for context.
+
+5. After rockout completes: rerun the same benchmark script. Print a
+   before/after comparison:
+   | Backend    | Metric       | Before | After  | Ratio | Verdict    |
+   |------------|-------------|--------|--------|-------|------------|
+   | numpy      | wall_ms     | 45.2   | 12.1   | 0.27x | IMPROVED   |
+   | dask+numpy | peak_rss_mb | 892    | 34     | 0.04x | IMPROVED   |
+   Thresholds: IMPROVED < 0.8x, REGRESSION > 1.2x, else UNCHANGED.
+
+6. Update .claude/performance-sweep-state.json with the issue number.
+
+7. Output <promise>ITERATION DONE</promise>
+
+If all targets have been addressed or confirmed as false positives:
+<promise>ALL PERFORMANCE ISSUES FIXED</promise>." --max-iterations <N+2> --completion-promise "ALL PERFORMANCE ISSUES FIXED"
+````
+
+Set `--max-iterations` to the number of target modules plus 2 (buffer for
+retries).
+
+### 6c. Print reminder text
+
+```
+Phase 1 triage complete. To proceed with fixes:
+  Copy the ralph-loop command above and paste it.
+
+Other options:
+  Fix one manually:    copy any /rockout command from the report above
+  Rerun triage only:   /sweep-performance --report-only
+  Skip Phase 1:        /sweep-performance --skip-phase1 (reuses last triage)
+  Reset all tracking:  /sweep-performance --reset-state
+```
+````
+
+- [ ] **Step 2: Verify the full command file structure**
+
+Run: `grep "^## Step" .claude/commands/sweep-performance.md`
+Expected output:
+```
+## Step 0 -- Determine mode and parse arguments
+## Step 1 -- Discover modules in scope
+## Step 2 -- Gather metadata and score each module
+## Step 3 -- Dispatch parallel subagents for static triage
+## Step 4 -- Merge results and print the triage report
+## Step 5 -- Update state file
+## Step 6 -- Generate the ralph-loop command
+```
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add .claude/commands/sweep-performance.md
+git commit -m "Add Phase 2 ralph-loop generation to sweep-performance"
+```
+
+---
+
+### Task 6: General Rules and Final Polish
+
+**Files:**
+- Modify: `.claude/commands/sweep-performance.md`
+
+- [ ] **Step 1: Append the General Rules section**
+
+Append the following to `.claude/commands/sweep-performance.md`:
+
+```markdown
+---
+
+## General Rules
+
+- Phase 1 subagents do NOT modify any source, test, or benchmark files.
+  Read-only analysis only.
+- Phase 2 ralph-loop modifies code only through `/rockout`.
+- Temporary benchmark scripts and graph simulation scripts go in `/tmp/`
+  with unique names including the module name (e.g. `/tmp/perf_sweep_bench_slope.py`,
+  `/tmp/perf_sweep_graph_slope.py`). Clean them up after capturing results.
+- Only flag patterns that are ACTUALLY present in the code. Do not report
+  hypothetical issues or patterns that "could" occur.
+- Include the exact file path and line number for every finding so the user
+  can navigate directly to the issue.
+- False positives are worse than missed issues. If you are not confident a
+  pattern is actually harmful in context (e.g. `.values` used intentionally
+  on a known-numpy array), do not flag it.
+- The 30TB simulation constructs the dask task graph only; it NEVER calls
+  `.compute()`.
+- State file (`.claude/performance-sweep-state.json`) is gitignored by
+  convention — do not add it to git.
+- If $ARGUMENTS is empty, use defaults: audit all modules, benchmark at
+  512x512, generate ralph-loop for HIGH items.
+- For subpackage modules (geotiff, reproject), the subagent should read ALL
+  `.py` files in the subpackage directory, not just `__init__.py`.
+- When generating `/rockout` commands, include the OOM verdict, bottleneck
+  classification, and affected backends in the prompt text so rockout has
+  full performance context.
+```
+
+- [ ] **Step 2: Read the full file end-to-end and verify structure**
+
+Run: `wc -l .claude/commands/sweep-performance.md`
+Expected: Roughly 300-400 lines.
+
+Run: `grep "^## " .claude/commands/sweep-performance.md`
+Expected: Step 0 through Step 6, plus "General Rules".
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add .claude/commands/sweep-performance.md
+git commit -m "Add general rules and finalize sweep-performance command"
+```
+
+---
+
+### Task 7: Smoke Test the Command
+
+**Files:**
+- Read: `.claude/commands/sweep-performance.md`
+
+- [ ] **Step 1: Verify the command appears in the slash command list**
+
+Run: `ls .claude/commands/sweep-performance.md`
+Expected: File exists.
+
+- [ ] **Step 2: Verify all cross-references are consistent**
+
+Check that:
+- The state file path `.claude/performance-sweep-state.json` is spelled
+  the same everywhere in the file.
+- The subagent JSON output schema field names match the fields referenced
+  in Step 4 (report merging).
+- The `/rockout` and `/ralph-loop` command syntax matches the patterns used
+  in the existing `accuracy-sweep.md` and `rockout.md` commands.
+
+Run: `grep -c "performance-sweep-state.json" .claude/commands/sweep-performance.md`
+Expected: At least 4 occurrences (Steps 2, 5, 6, and General Rules).
+
+Run: `grep -c "/rockout" .claude/commands/sweep-performance.md`
+Expected: At least 3 occurrences (Steps 4d, 6b, General Rules).
+
+Run: `grep -c "/ralph-loop" .claude/commands/sweep-performance.md`
+Expected: At least 2 occurrences (Step 6b, Step 6c).
+
+- [ ] **Step 3: Verify subagent output schema fields match report consumption**
+
+These field names must appear in both the subagent prompt (Step 3) and the
+report merging logic (Step 4):
+- `module`
+- `findings` (with `severity`, `category`, `file`, `line`, `description`, `fix`, `backends_affected`)
+- `oom_verdict` (with `dask_numpy`, `dask_cupy`, `reasoning`)
+- `bottleneck`
+
+Run: `grep -c '"oom_verdict"' .claude/commands/sweep-performance.md`
+Expected: At least 2 (schema definition + state file).
+
+Run: `grep -c '"bottleneck"' .claude/commands/sweep-performance.md`
+Expected: At least 2.
+
+- [ ] **Step 4: Final commit with all checks passing**
+
+```bash
+git add .claude/commands/sweep-performance.md
+git commit -m "Verify sweep-performance command integrity"
+```
+
+Only commit if there were fixes needed. If all checks passed with no
+changes, skip this commit.
diff --git a/docs/superpowers/specs/2026-03-31-sweep-performance-design.md b/docs/superpowers/specs/2026-03-31-sweep-performance-design.md
new file mode 100644
index 00000000..d544e0af
--- /dev/null
+++ b/docs/superpowers/specs/2026-03-31-sweep-performance-design.md
@@ -0,0 +1,368 @@
+# Sweep-Performance: Parallel Performance Triage and Fix Workflow
+
+**Date:** 2026-03-31
+**Status:** Draft
+
+## Overview
+
+A `/sweep-performance` slash command that audits every xrspatial module for
+performance bottlenecks, OOM risk under large-scale dask workloads, and
+backend-specific anti-patterns. Uses parallel subagents for fast static triage,
+then a sequential ralph-loop to benchmark and fix confirmed HIGH-severity
+issues.
+
+The central question for every dask backend: "If the data on disk was 30TB
+and the machine only had 16GB of RAM, would this tool cause an out-of-memory
+error?"
+
+## Scope
+
+All `.py` modules under `xrspatial/` plus the `geotiff/` and `reproject/`
+subpackages. Excludes `__init__.py`, `_version.py`, `__main__.py`, `utils.py`,
+`accessor.py`, `preview.py`, `dataset_support.py`, `diagnostics.py`,
+`analytics.py`.
+
+## Architecture
+
+Two phases in a single invocation:
+
+```
+/sweep-performance
+    |
+    +-- Phase 1: Parallel Static Triage
+    |   |-- Score & rank modules (git metadata + complexity heuristics)
+    |   |-- Dispatch one subagent per module
+    |   |   |-- Static analysis (dask, GPU, memory, Numba patterns)
+    |   |   |-- 30TB/16GB OOM simulation (task graph construction, no compute)
+    |   |   +-- Return structured JSON findings
+    |   |-- Merge results into ranked report
+    |   +-- Update state file
+    |
+    +-- Phase 2: Ralph-Loop (HIGH severity only)
+        |-- Generate /ralph-loop command targeting HIGH modules
+        |-- Each iteration:
+        |   |-- Real benchmarks (wall time, tracemalloc, RSS, CuPy pool)
+        |   |-- Confirm finding is not false positive
+        |   |-- /rockout to fix
+        |   |-- Post-fix benchmark comparison
+        |   +-- Update state file
+        +-- User pastes command to start
+```
+
+---
+
+## Phase 1: Module Scoring
+
+For every module in scope, collect via git:
+
+| Field              | Source                                                    |
+|--------------------|-----------------------------------------------------------|
+| `last_modified`    | `git log -1 --format=%aI -- <path>`                      |
+| `total_commits`    | `git log --oneline -- <path> \| wc -l`                   |
+| `loc`              | `wc -l < <path>`                                         |
+| `has_dask_backend` | grep for `_run_dask`, `map_overlap`, `map_blocks`         |
+| `has_cuda_backend` | grep for `@cuda.jit`, `import cupy`                       |
+| `is_io_module`     | module is in geotiff/ or reproject/                       |
+| `has_existing_bench` | matching file exists in `benchmarks/benchmarks/`        |
+
+### Scoring Formula
+
+```
+days_since_inspected = (today - last_perf_inspected).days   # 9999 if never
+days_since_modified  = (today - last_modified).days
+
+score = (days_since_inspected * 3)
+      + (loc * 0.1)
+      + (total_commits * 0.5)
+      + (has_dask_backend * 200)
+      + (has_cuda_backend * 150)
+      + (is_io_module * 300)
+      - (days_since_modified * 0.2)
+      - (has_existing_bench * 100)
+```
+
+Rationale:
+- Never-inspected modules dominate (9999 * 3 = ~30,000).
+- Dask and CUDA backends boosted: that is where OOM and perf bugs live.
+- I/O modules get the highest boost: most relevant for 30TB question.
+- Larger modules more likely to contain issues.
+- Existing ASV benchmarks slightly deprioritize (perf already considered).
+
+---
+
+## Phase 1: Subagent Static Analysis
+
+One subagent per module. Each performs the checks below and returns a
+structured JSON blob.
+
+### Dask Path Analysis
+
+- `.values` on dask-backed DataArray (premature materialization) — **HIGH**
+- `.compute()` inside a loop — **HIGH**
+- `np.array()` / `np.asarray()` wrapping dask or CuPy array — **HIGH**
+- `da.stack()` without `.rechunk()` — **MEDIUM**
+- `map_overlap` with depth >= chunk_size / 4 — **MEDIUM**
+- Missing `boundary` argument in `map_overlap` — **MEDIUM**
+- Redundant computation (same function called twice on same input) — **MEDIUM**
+- Python loops over dask chunks (serializes the graph) — **MEDIUM**
+
+### 30TB / 16GB OOM Verdict
+
+Two-part analysis for each dask code path:
+
+**Part 1 — Static trace.** Follow the dask code path and answer: does peak
+memory scale with total array size, or with chunk size? If any step forces
+full materialization, verdict is WILL OOM.
+
+**Part 2 — Task graph simulation.** Write and execute a script that:
+
+```python
+import dask.array as da
+import xarray as xr
+
+# Use a representative grid (2560x2560, 10x10 = 100 chunks) to inspect
+# graph structure. The pattern is identical at any scale — what matters
+# is whether the graph fans out, materializes, or stays chunk-local.
+arr = da.zeros((2560, 2560), chunks=(256, 256), dtype='float64')
+raster = xr.DataArray(arr, dims=['y', 'x'])
+
+# Call the function lazily
+result = module_function(raster, **default_args)
+
+# Inspect the graph without executing
+graph = result.__dask_graph__()
+task_count = len(graph)
+tasks_per_chunk = task_count / 100  # normalize to per-chunk
+
+# Check for fan-out patterns or full-materialization nodes
+# Extrapolate to 30TB: ~57 million chunks at 256x256 float64
+# If tasks_per_chunk is constant => graph scales linearly => SAFE
+# If any node depends on all chunks => full materialization => WILL OOM
+```
+
+The script constructs the graph only, never calls `.compute()`. Reports:
+- Task count and tasks-per-chunk ratio
+- Estimated peak memory per chunk (MB)
+- Whether the graph contains fan-out or materialization nodes
+- Extrapolation to 30TB: linear graph growth (SAFE) vs fan-out (WILL OOM)
+
+**Verdict**: `SAFE`, `RISKY` (bounded but tight), or `WILL OOM` (unbounded
+or materializes).
+
+### GPU Transfer Analysis
+
+- `.data.get()` followed by CuPy ops (GPU-CPU-GPU round-trip) — **HIGH**
+- `cupy.asarray()` inside a hot loop — **HIGH**
+- Mixing NumPy/CuPy ops without reason — **MEDIUM**
+- Register pressure: >20 float64 locals in `@cuda.jit` kernel — **MEDIUM**
+- Thread blocks >16x16 on register-heavy kernels — **MEDIUM**
+
+### Memory Allocation Patterns
+
+- Unnecessary `.copy()` on arrays never mutated — **MEDIUM**
+- `np.zeros_like()` + fill loop (could be `np.empty()`) — **LOW**
+- Large temporary arrays that could be fused into the kernel — **MEDIUM**
+
+### Numba Anti-Patterns
+
+- Missing `@ngjit` on nested for-loops over `.data` arrays — **MEDIUM**
+- `@jit` without `nopython=True` (object-mode fallback risk) — **MEDIUM**
+- Type instability (int/float mixing in Numba functions) — **LOW**
+- Column-major iteration on row-major arrays (cache-unfriendly) — **LOW**
+
+### Bottleneck Classification
+
+Based on static analysis, classify the module as one of:
+- **IO-bound** — dominated by disk reads/writes or serialization
+- **Memory-bound** — peak allocation is the limiting factor
+- **Compute-bound** — CPU/GPU time dominates, memory is fine
+- **Graph-bound** — dask task graph overhead dominates (too many small tasks)
+
+### Subagent Output Schema
+
+```json
+{
+  "module": "slope",
+  "files_read": ["xrspatial/slope.py"],
+  "findings": [
+    {
+      "severity": "HIGH",
+      "category": "dask_materialization",
+      "file": "slope.py",
+      "line": 142,
+      "description": ".values on dask input in _run_dask",
+      "fix": "Use .data.compute() or restructure to stay lazy",
+      "backends_affected": ["dask+numpy", "dask+cupy"]
+    }
+  ],
+  "oom_verdict": {
+    "dask_numpy": "SAFE",
+    "dask_cupy": "SAFE",
+    "reasoning": "map_overlap with depth=1, memory bounded by chunk size",
+    "estimated_peak_per_chunk_mb": 0.5,
+    "task_count": 3721,
+    "graph_simulation_ran": true
+  },
+  "bottleneck": "compute-bound",
+  "bottleneck_reasoning": "3x3 kernel with Numba JIT, no I/O, small overlap"
+}
+```
+
+---
+
+## Phase 1: Merged Report
+
+After all subagents return, print a consolidated report.
+
+### Module Risk Ranking Table
+
+```
+| Rank | Module        | Score | OOM Verdict     | Bottleneck   | HIGH | MED | LOW |
+|------|---------------|-------|-----------------|--------------|------|-----|-----|
+| 1    | geotiff       | 31200 | WILL OOM (d+np) | IO-bound     | 3    | 1   | 0   |
+| 2    | viewshed      | 30050 | RISKY (d+np)    | memory-bound | 2    | 2   | 1   |
+| ...  | ...           | ...   | ...             | ...          | ...  | ... | ... |
+```
+
+### 30TB / 16GB Verdict Summary
+
+Grouped by verdict:
+
+- **WILL OOM (fix required):** list modules with reasoning
+- **RISKY (bounded but tight):** list modules with reasoning
+- **SAFE (memory bounded by chunk size):** list modules
+
+### Detailed Findings
+
+Per-module table of all findings grouped by severity (file:line, pattern,
+description, fix).
+
+### Actionable Rockout Commands
+
+For each HIGH-severity finding, a ready-to-paste `/rockout` command.
+
+### State File Update
+
+Write `.claude/performance-sweep-state.json`:
+
+```json
+{
+  "last_triage": "2026-03-31T14:00:00Z",
+  "modules": {
+    "slope": {
+      "last_inspected": "2026-03-31T14:00:00Z",
+      "oom_verdict": "SAFE",
+      "bottleneck": "compute-bound",
+      "high_count": 0,
+      "issue": null
+    }
+  }
+}
+```
+
+---
+
+## Phase 2: Ralph-Loop for HIGH Severity Fixes
+
+Collect all modules with at least one HIGH-severity finding. Generate a
+`/ralph-loop` command targeting them in priority order.
+
+### Each Iteration
+
+1. **Benchmark** the module on a moderate array (512x512 default) across all
+   available backends. Measure four metrics per backend per function:
+   - Wall time: `timeit.repeat(number=1, repeat=3)`, median
+   - Python memory: `tracemalloc.get_traced_memory()` peak
+   - Process memory: `resource.getrusage(RUSAGE_SELF).ru_maxrss` delta
+   - GPU memory (if CuPy): `cupy.get_default_memory_pool().used_bytes()` delta
+
+2. **Confirm the static finding** from Phase 1 is real. If the benchmark
+   shows the issue does not manifest (false positive), downgrade to MEDIUM
+   in the report and skip to next module.
+
+3. **Classify the bottleneck** with measured data:
+   - IO-bound: wall time dominated by read/write, low CPU
+   - Memory-bound: peak RSS much larger than expected for chunk size
+   - Compute-bound: CPU pegged, memory stable
+   - Graph-bound: dask task count extremely high, scheduler overhead visible
+
+4. **Run `/rockout`** to fix the confirmed issue (GitHub issue, worktree,
+   implementation, tests, docs).
+
+5. **Post-fix benchmark** — rerun the same benchmark. Report before/after
+   delta.
+
+6. **Update state** — record the fix in
+   `.claude/performance-sweep-state.json` with issue number.
+
+7. Output `<promise>ITERATION DONE</promise>`.
+
+### Generated Command Shape
+
+```
+/ralph-loop "Performance sweep Phase 2: benchmark and fix HIGH-severity findings.
+
+**Target modules in priority order:**
+1. geotiff (3 HIGH findings, WILL OOM) -- eager .values materialization
+2. cost_distance (1 HIGH finding, WILL OOM) -- iterative solver unbounded memory
+
+**For each module:**
+1. Write and run a benchmark script measuring wall time, peak memory
+   (tracemalloc + RSS + CuPy pool) across all available backends
+2. Confirm the HIGH finding from Phase 1 triage is real
+3. If confirmed: run /rockout to fix it end-to-end
+4. After rockout: rerun benchmark, report before/after delta
+5. Update .claude/performance-sweep-state.json
+6. Output <promise>ITERATION DONE</promise>
+
+If all targets addressed: <promise>ALL PERFORMANCE ISSUES FIXED</promise>."
+--max-iterations {N+2} --completion-promise "ALL PERFORMANCE ISSUES FIXED"
+```
+
+### Reminder Text
+
+```
+Phase 1 triage complete. To proceed with fixes:
+  Copy the ralph-loop command above and paste it.
+
+Other options:
+  Fix one manually:    copy any /rockout command from the report above
+  Rerun triage only:   /sweep-performance --report-only
+  Skip Phase 1:        /sweep-performance --skip-phase1 (reuses last triage)
+  Reset all tracking:  /sweep-performance --reset-state
+```
+
+---
+
+## Arguments
+
+| Argument           | Effect                                                     |
+|--------------------|------------------------------------------------------------|
+| `--top N`          | Limit Phase 1 subagents to top N scored modules (default: all) |
+| `--exclude m1,m2`  | Remove named modules from scope                           |
+| `--only-terrain`   | slope, aspect, curvature, terrain, terrain_metrics, hillshade, sky_view_factor |
+| `--only-focal`     | focal, convolution, morphology, bilateral, edge_detection, glcm |
+| `--only-hydro`     | flood, cost_distance, geodesic, surface_distance, viewshed, erosion, diffusion |
+| `--only-io`        | geotiff, reproject, rasterize, polygonize                  |
+| `--reset-state`    | Delete state file and start fresh                          |
+| `--skip-phase1`    | Reuse last triage state, go straight to ralph-loop generation |
+| `--report-only`    | Run Phase 1 only, no ralph-loop command                    |
+| `--size small`     | Benchmark at 128x128                                       |
+| `--size large`     | Benchmark at 2048x2048                                     |
+| `--high-only`      | Only report HIGH severity findings                         |
+
+Default (no arguments): audit all modules, benchmark at 512x512, generate
+ralph-loop for HIGH items.
+
+---
+
+## General Rules
+
+- Phase 1 subagents do NOT modify source files. Read-only analysis.
+- Phase 2 ralph-loop modifies code only through `/rockout`.
+- Temporary benchmark scripts go in `/tmp/` with unique names.
+- Only flag patterns actually present in the code; no hypothetical issues.
+- Include exact file path and line number for every finding.
+- False positives are worse than missed issues.
+- The 30TB simulation constructs the dask graph only; it never calls `.compute()`.
+- State file (`.claude/performance-sweep-state.json`) is gitignored by convention.
diff --git a/xrspatial/surface_distance.py b/xrspatial/surface_distance.py
index 5ff8d03b..572181cc 100644
--- a/xrspatial/surface_distance.py
+++ b/xrspatial/surface_distance.py
@@ -305,6 +305,20 @@ def _precompute_dd_grid(lat_2d, lon_2d, dy, dx):
     """
     H, W = lat_2d.shape
     n = len(dy)
+    # Memory guard: dd_grid is (n_neighbors, H, W) float64
+    estimated = n * H * W * 8
+    try:
+        from xrspatial.zonal import _available_memory_bytes
+        avail = _available_memory_bytes()
+    except ImportError:
+        avail = 2 * 1024**3
+    if estimated > 0.8 * avail:
+        raise MemoryError(
+            f"Geodesic dd_grid needs ~{estimated / 1e9:.1f} GB "
+            f"({n} neighbors x {H}x{W} x 8 bytes) but only "
+            f"~{avail / 1e9:.1f} GB available.  Use planar mode "
+            f"or downsample the raster."
+        )
     dd_grid = np.zeros((n, H, W), dtype=np.float64)
 
     for i in range(n):