From 340520581453c33cce4c76f51ee00220d9ee26e9 Mon Sep 17 00:00:00 2001
From: Allison Piper <alliepiper16@gmail.com>
Date: Tue, 12 May 2026 15:50:03 -0400
Subject: [PATCH 1/7] Add repo-local agent infrastructure: skills, agents, and
 bootstrap

Establishes a single source-of-truth bootstrap (AGENTS.md) and a
catalogue of 14 skills + 4 agents under `.agent/{skills,agents}/`
that route by user intent. Both Claude Code and Codex resolve the
same files via the `.claude/{skills,agents}` symlinks.

Skills:
- cccl, cccl-agent-impl  - orientation + concept primer
- cccl-clarify           - decision-point escalation
- cccl-commit            - interactive commit prep
- cccl-pr                - PR lifecycle (open / edit / comment / push + CI)
- cccl-resplit-branch    - rebase + resplit commit history
- cccl-triage-pr         - diagnose CI failures on a PR
- cccl-triage-nightly    - diagnose CI failures in the latest nightly
- cccl-ci, cccl-ci-benchmarks, cccl-bisect, cccl-devcontainers,
  cccl-build-and-test-targets, cccl-cpp-builds, cccl-python,
  cccl-sass-diff, cccl-libcudacxx-style - CI / build / test references

Agents (haiku, non-interactive):
- cccl-ok-to-test         - SHA-verified `/ok to test` poster
- cccl-fetch-ci-failures  - paginated job-failure TSV
- cccl-summarize-job-log  - 5-10 line log digest
- cccl-ci-overrides       - matrix-override YAML + skip-tag generation

Bootstrap:
- AGENTS.md - minimal routing README pointing at the `cccl` skill
- CLAUDE.md - symlink to AGENTS.md
- .claude/settings.json - read-only allow-list (gh / git read forms,
  rg / grep / jq / sed -n, ls / cat / head / tail / wc / file / stat,
  mkdir -p /tmp/claude/*) plus SessionStart hook surfacing `cccl`.
  Mutating ops intentionally not allow-listed - they prompt every use.

Also renames `.agent/skills/libcudacxx-style/` to
`.agent/skills/cccl-libcudacxx-style/` to match the cccl-* prefix
convention across the rest of the catalogue.
---
 .agent/agents/cccl-ci-overrides.md            | 131 +++++
 .agent/agents/cccl-fetch-ci-failures.md       |  53 ++
 .agent/agents/cccl-ok-to-test.md              |  53 ++
 .agent/agents/cccl-summarize-job-log.md       |  59 ++
 .agent/skills/cccl-agent-impl/SKILL.md        |  54 ++
 .agent/skills/cccl-bisect/SKILL.md            |  73 +++
 .../cccl-build-and-test-targets/SKILL.md      |  73 +++
 .agent/skills/cccl-ci-benchmarks/SKILL.md     |  55 ++
 .agent/skills/cccl-ci/SKILL.md                |  54 ++
 .agent/skills/cccl-clarify/SKILL.md           |  43 ++
 .agent/skills/cccl-commit/SKILL.md            | 120 ++++
 .agent/skills/cccl-cpp-builds/SKILL.md        |  53 ++
 .agent/skills/cccl-devcontainers/SKILL.md     |  58 ++
 .../SKILL.md                                  |   2 +-
 .agent/skills/cccl-pr/SKILL.md                | 100 ++++
 .agent/skills/cccl-python/SKILL.md            |  46 ++
 .agent/skills/cccl-resplit-branch/SKILL.md    | 111 ++++
 .agent/skills/cccl-sass-diff/SKILL.md         |  32 ++
 .agent/skills/cccl-triage-nightly/SKILL.md    |  42 ++
 .agent/skills/cccl-triage-pr/SKILL.md         |  85 +++
 .agent/skills/cccl/SKILL.md                   |  39 ++
 .claude/agents                                |   1 +
 .claude/settings.json                         |  69 +++
 .claude/skills                                |   1 +
 .claude/skills/libcudacxx-style/SKILL.md      |   6 -
 AGENTS.md                                     | 515 +++---------------
 CLAUDE.md                                     |   7 +-
 27 files changed, 1481 insertions(+), 454 deletions(-)
 create mode 100644 .agent/agents/cccl-ci-overrides.md
 create mode 100644 .agent/agents/cccl-fetch-ci-failures.md
 create mode 100644 .agent/agents/cccl-ok-to-test.md
 create mode 100644 .agent/agents/cccl-summarize-job-log.md
 create mode 100644 .agent/skills/cccl-agent-impl/SKILL.md
 create mode 100644 .agent/skills/cccl-bisect/SKILL.md
 create mode 100644 .agent/skills/cccl-build-and-test-targets/SKILL.md
 create mode 100644 .agent/skills/cccl-ci-benchmarks/SKILL.md
 create mode 100644 .agent/skills/cccl-ci/SKILL.md
 create mode 100644 .agent/skills/cccl-clarify/SKILL.md
 create mode 100644 .agent/skills/cccl-commit/SKILL.md
 create mode 100644 .agent/skills/cccl-cpp-builds/SKILL.md
 create mode 100644 .agent/skills/cccl-devcontainers/SKILL.md
 rename .agent/skills/{libcudacxx-style => cccl-libcudacxx-style}/SKILL.md (99%)
 create mode 100644 .agent/skills/cccl-pr/SKILL.md
 create mode 100644 .agent/skills/cccl-python/SKILL.md
 create mode 100644 .agent/skills/cccl-resplit-branch/SKILL.md
 create mode 100644 .agent/skills/cccl-sass-diff/SKILL.md
 create mode 100644 .agent/skills/cccl-triage-nightly/SKILL.md
 create mode 100644 .agent/skills/cccl-triage-pr/SKILL.md
 create mode 100644 .agent/skills/cccl/SKILL.md
 create mode 120000 .claude/agents
 create mode 100644 .claude/settings.json
 create mode 120000 .claude/skills
 delete mode 100644 .claude/skills/libcudacxx-style/SKILL.md
 mode change 100644 => 120000 CLAUDE.md

diff --git a/.agent/agents/cccl-ci-overrides.md b/.agent/agents/cccl-ci-overrides.md
new file mode 100644
index 00000000000..7ab63386ee1
--- /dev/null
+++ b/.agent/agents/cccl-ci-overrides.md
@@ -0,0 +1,131 @@
+---
+name: cccl-ci-overrides
+description: "Use this agent when a caller skill wants to limit CCCL CI cost on a PR via `workflows.override` matrix entries and/or `[skip-*]` commit tags. Typical triggers include cccl-triage-pr building a targeted-repro override after diagnosing failures, cccl-triage-nightly building one with `for_workflow: nightly`, and commit-prep flows asking \"what override + skip tags fit this diff?\". Takes working changes (paths or diff range) and/or a list of failed-job names; returns override snippet + skip tags + per-decision rationale. Knows `ci/project_files_and_dependencies.yaml`, `ci/matrix.yaml`, and `ci-overview.md`. Non-interactive. See \"When to invoke\" in the agent body for worked scenarios."
+model: sonnet
+color: magenta
+tools: Bash, Read, Grep
+---
+
+# cccl-ci-overrides
+
+Advise on CI cost-limiting measures — override matrix entries and skip tags.
+
+## When to invoke
+
+- **Targeted repro from failed jobs.** Triage skill diagnosed failures and wants the minimum override matrix that
+  reproduces them on a subsequent CI run.
+- **Diff-driven override.** Commit-prep flow has a set of changed paths (or a diff range) and wants to know which
+  matrix entries are needed and which `[skip-*]` tags are safe.
+- **Combined input.** Both failed-job list and changed paths; the agent unions and de-dupes the entries.
+
+## Sources of truth
+
+- `ci/project_files_and_dependencies.yaml` — project definitions, `include_regexes`, `exclude_regexes`,
+  `exclude_project_files`, `lite_dependencies`, `full_dependencies`, global `ignore_regexes`. `core` is special:
+  any unmatched non-ignored file marks `core` dirty → full rebuild.
+- `ci/matrix.yaml` — `workflows.override` schema (see top-of-file examples). Workflow sections: `pull_request`,
+  `pull_request_lite`, `nightly`, `weekly`, `python-wheels`, `devcontainers`. Plus `exclude:` rules, `jobs:`
+  catalogue (job-key → `name:`), `projects:` catalogue, `tags:` defaults (notably
+  `project: { default: ['libcudacxx', 'cub', 'thrust'] }`).
+- `ci-overview.md` — canonical `[skip-*]` tokens.
+
+## Tool to lean on
+
+`ci/inspect_changes.py --refs <BASE> <HEAD>` (or `--file`, `--stdin`) already implements the dep-graph trace and
+honors `ignore_regexes` + `exclude_*` rules. Prefer it over re-implementing.
+
+## Inputs
+
+Any combination of:
+
+- `paths:` (newline-separated changed paths) OR `diff_range: <BASE>..<HEAD>` — drives override + skip-tag
+  analysis.
+- `failed_jobs:` (path to file with failed-job names, one per line) — drives direct-reproduction override.
+- `for_workflow:` — `pull_request` (default) | `pull_request_lite` | `nightly` | `weekly`.
+
+At least one of `paths`/`diff_range`/`failed_jobs` required.
+
+## Override matrix — from changes
+
+1. Run `ci/inspect_changes.py` to classify dirty projects.
+2. From `for_workflow`'s section, pull entries that name a dirty project (or omit `project:` and the default set
+   intersects dirty).
+3. Subtract `exclude:` matches.
+4. Emit as override entries.
+
+## Override matrix — from failed jobs
+
+1. Parse each name: `[CTK<X> <COMPILER><VER> C++<STD>] <Project> <JobName>(<Arch>)`. Cross-reference `jobs:` in
+   matrix.yaml to map `<JobName>` (e.g. `BuildHostLaunch`, `TestNoLaunch`, `NVRTC`) → job key (e.g. `build_lid0`,
+   `test_nolid`, `nvrtc`).
+2. Build the minimum override entry per name — `{jobs: [<key>], project: <name>, std: <std>, ctk: <ctk>,
+   cxx: <cxx>, gpu: <gpu if test>}`.
+3. Merge entries sharing `(project, jobs)`; combine `std`/`ctk`/`cxx` into lists.
+
+## Combining inputs
+
+If caller provides both, union the entries. De-dupe.
+
+## Snippet format
+
+```yaml
+# Targeted repro of <source>. Reset before merging.
+- {jobs: ['build'], project: 'libcudacxx', std: 'all', ctk: ['12.0', '12.X'], cxx: ['gcc8', 'gcc9', 'gcc10']}
+- {jobs: ['build'], project: 'cub',        std: 17,    ctk: ['12.0', '12.X'], cxx: ['gcc8']}
+```
+
+`<source>` = nightly run ID / PR check context / `<diff_range>` / "manual triage".
+
+For targeted repro via `build_and_test_targets.sh`, prefer the `target` project pattern from matrix.yaml's
+top-of-file example:
+
+```yaml
+- { jobs: ['run_gpu'], project: 'target', ctk: ['13.X'], cxx: 'gcc', gpu: 'rtxa6000',
+    args: '--preset cub-cpp20 --build-targets "cub.cpp20.test.iterator" --ctest-targets "cub.cpp20.test.iterator"' }
+```
+
+If `workflows.override:` is already non-empty, emit as **additions** — caller decides whether to append or
+replace.
+
+## Skip tags (path-based)
+
+For each `[skip-*]` token in `ci-overview.md`, suggest if no changed path matches the area it protects:
+
+| Tag              | Suggest when no changed path matches          |
+|------------------|-----------------------------------------------|
+| `[skip-docs]`    | `docs/`, `*.rst`                              |
+| `[skip-vdc]`     | `.devcontainer/`, `ci/`, `.github/workflows/` |
+| `[skip-tpt]`     | third-party canary triggers                   |
+| `[skip-rapids]`  | RAPIDS paths (subset of tpt)                  |
+| `[skip-matx]`    | MatX paths (subset of tpt)                    |
+| `[skip-pytorch]` | PyTorch paths (subset of tpt)                 |
+| `[skip-matrix]`  | no CCCL build/test code (rare — docs/CI-only) |
+
+Changes purely within `workflows.override:` target CI scope, not CI infra — don't withhold `[skip-vdc]` for them.
+Paths matching `ignore_regexes` already don't trigger CI — exclude in both directions.
+
+Note that the skip tags only apply to the last commit in a branch; save them until the end if making multiple
+commits.
+
+## Output
+
+```
+## Override matrix snippet (insert under `workflows.override:`)
+
+```yaml
+# <source>. Reset before merging.
+<entries>
+```
+
+## Skip tags
+
+`[skip-vdc][skip-docs][skip-tpt]`
+
+## Rationale
+
+- Override: <why these reproduce the targeted jobs>
+- Skip tags: <what each protects, what the diff doesn't touch>
+- Inputs: <inspect_changes.py summary, failed-job count>
+```
+
+Omit "Override matrix snippet" if no entries; omit "Skip tags" if no `paths`/`diff_range` given.
diff --git a/.agent/agents/cccl-fetch-ci-failures.md b/.agent/agents/cccl-fetch-ci-failures.md
new file mode 100644
index 00000000000..73fbf267f94
--- /dev/null
+++ b/.agent/agents/cccl-fetch-ci-failures.md
@@ -0,0 +1,53 @@
+---
+name: cccl-fetch-ci-failures
+description: "Use this agent when a caller skill needs the list of failed jobs from a CCCL CI run, given either a PR number or a workflow run ID. Typical triggers include cccl-triage-pr collecting failures for the current branch's PR, cccl-triage-nightly collecting failures for the latest scheduled nightly run, and any other skill that needs failed-job TSV output for downstream summarization or override-matrix generation. Output is a TSV at a caller-specified path with one row per failed job: `<job-id>\\t<full-name>\\t<grouping-hint>`. Handles `gh api --paginate` and the `jq -s` slurp gotcha. Non-interactive. See \"When to invoke\" in the agent body for worked scenarios."
+model: haiku
+color: cyan
+tools: Bash, Read
+---
+
+# cccl-fetch-ci-failures
+
+Return failed jobs from a CCCL CI run as TSV.
+
+## When to invoke
+
+- **Triage-PR fetch.** A PR-triage skill has the PR number and needs a TSV of failed jobs to pick representatives
+  for log-fetching. Caller hands over PR#, output path, scratch dir.
+- **Triage-nightly fetch.** A nightly-triage skill has the workflow run ID (resolved from
+  `gh run list --workflow=ci-workflow-nightly.yml`) and needs the same TSV. Caller hands over run ID, output path,
+  scratch dir.
+
+## Inputs
+
+One of:
+- `pr: <PR#>` — latest run on the PR.
+- `run: <RUN_ID>` — specific workflow run.
+
+Plus `output: <path>` and `scratch: <dir>`. Missing any → abort.
+
+## Steps
+
+1. **Resolve the run ID.** If `pr:` given:
+   - `gh pr view <PR#> --repo NVIDIA/cccl --json headRefName,headRefOid` → `BRANCH`, `HEAD_SHA`.
+   - `gh run list --repo NVIDIA/cccl --branch <BRANCH> --limit 5 --json databaseId,headSha,conclusion` → pick the
+     latest entry whose `headSha == HEAD_SHA`. No match → abort.
+   - `RUN_ID = databaseId` from that entry.
+
+   Avoid `gh pr view --json statusCheckRollup` — it returns 100k+ tokens on CCCL PRs.
+2. **Fetch jobs.** `gh api repos/NVIDIA/cccl/actions/runs/<RUN_ID>/jobs?per_page=100 --paginate` into
+   `<scratch>/jobs_raw.json`. `--paginate` concatenates objects; subsequent `jq` needs `-s`.
+3. **Extract failures.** `jq -s -r '[.[].jobs[] | select(.conclusion == "failure")] | .[] | [.id, .name] | @tsv'`
+   into `<scratch>/failed_jobs_raw.tsv`. Empty → return zero-failures.
+4. **Append grouping hints.** Per row, parse the name and append `<toolchain>|<project>|<variant>`:
+   - Toolchain: `[CTK<X> <COMPILER><VER> C++<STD>]` substring.
+   - Project: CUB / libcudacxx / Thrust / cudax / Python.
+   - Variant: Build / Test / HostLaunch / DeviceLaunch / TestNoLaunch / etc.
+
+   Example row:
+   ```
+   74849038365	[CTK13.2 GCC15 C++20] cudax TestNoLaunch(amd64)	CTK13.2 GCC15 C++20|cudax|TestNoLaunch
+   ```
+
+   Write to `<output>`.
+5. **Return summary** — count + tally of the third column.
diff --git a/.agent/agents/cccl-ok-to-test.md b/.agent/agents/cccl-ok-to-test.md
new file mode 100644
index 00000000000..be831cc13e0
--- /dev/null
+++ b/.agent/agents/cccl-ok-to-test.md
@@ -0,0 +1,53 @@
+---
+name: cccl-ok-to-test
+description: "Use this agent when a caller skill has pushed a commit to a CCCL PR's branch and wants to trigger CI by posting the copy-pr-bot `/ok to test <SHA>` comment. Typical triggers include cccl-triage-pr after a fix commit lands on an existing PR, cccl-triage-nightly after opening a new draft PR for a nightly fix, and any caller that needs the SHA-verification gate (local HEAD vs remote PR head) before posting. The agent verifies the local SHA matches the remote head, aborts on mismatch, posts the comment, and suggests the caller schedule a 20-minute polling loop. Non-interactive. Never pushes, never creates PRs, never force-pushes — the caller owns all of those decisions. See \"When to invoke\" in the agent body for worked scenarios."
+model: haiku
+color: yellow
+tools: Bash, Read
+---
+
+# cccl-ok-to-test
+
+Verify local-vs-remote SHA for a CCCL PR; post `/ok to test <SHA>`.
+
+## When to invoke
+
+- **PR-triage CI restart.** Caller has just pushed a fix commit to the existing PR's branch. Agent verifies local
+  HEAD matches remote head, posts `/ok to test <SHA>`, returns the SHA + a polling reminder.
+- **Nightly-triage first CI run.** Caller just created a draft PR for a nightly fix and needs the initial
+  `/ok to test`. Same flow.
+- **Mismatch gate.** Caller (or user) suspects local and remote may have diverged. Agent's first job is to
+  refuse-and-report on mismatch.
+
+## Inputs
+
+1. `<PR#>`
+2. `<OWNER/REPO>` (typically `NVIDIA/cccl`, always explicit)
+3. `<BRANCH>`
+
+Missing → abort naming the field.
+
+## Steps
+
+1. `git rev-parse HEAD` → `LOCAL_SHA`. The only SHA used in the comment; never derived elsewhere.
+2. `gh pr view <PR#> --repo <OWNER/REPO> --json headRefOid,isDraft,headRefName` → `REMOTE_SHA`, `isDraft`,
+   `headRefName`.
+3. `headRefName != <BRANCH>` → abort showing both.
+4. `LOCAL_SHA != REMOTE_SHA` → abort:
+   ```
+   ERROR: local HEAD does not match remote PR head.
+     local:   <LOCAL_SHA>
+     remote:  <REMOTE_SHA>
+   Likely: unpushed commits, or someone else pushed after you.
+   Aborting without posting `/ok to test`.
+   ```
+5. `gh pr comment <PR#> --repo <OWNER/REPO> --body "/ok to test <LOCAL_SHA>"`.
+6. Return:
+   ```
+   Posted `/ok to test <LOCAL_SHA>` on PR #<PR#>. Draft: <isDraft>.
+   Caller: consider `ScheduleWakeup(delaySeconds=1200)` polling on
+   `gh pr checks <PR#>`.
+   ```
+
+Local SHA is the contract — the caller just pushed it. Remote SHA is checked only as a sync gate against
+concurrent pushes.
diff --git a/.agent/agents/cccl-summarize-job-log.md b/.agent/agents/cccl-summarize-job-log.md
new file mode 100644
index 00000000000..77259d7632e
--- /dev/null
+++ b/.agent/agents/cccl-summarize-job-log.md
@@ -0,0 +1,59 @@
+---
+name: cccl-summarize-job-log
+description: "Use this agent when a caller skill has downloaded a single CCCL CI job log and needs a 5–10 line summary. Typical triggers include cccl-triage-pr or cccl-triage-nightly summarizing one representative log per failure cluster (dispatched in parallel — one agent per log), and any other workflow that wants to digest a job log without loading the full output into orchestrator context. Input is a path to a downloaded job log (typically `/tmp/claude/<sessionid>/job_<JID>.log`). Output covers first real error, failing command/step, stack trace, infra-vs-code classification, and anything CCCL-specific worth flagging. Non-interactive. See \"When to invoke\" in the agent body for worked scenarios."
+model: haiku
+color: cyan
+tools: Bash, Read, Grep
+---
+
+# cccl-summarize-job-log
+
+Read one CCCL CI job log; return a tight summary.
+
+## When to invoke
+
+- **Cluster-representative summarization.** A triage skill picked one representative job per failure cluster,
+  fetched logs to `/tmp/claude/<sessionid>/job_<JID>.log`, and dispatches one summarize agent per log in parallel.
+  Each returns first-error, failing-step, infra-vs-code classification.
+- **One-off log digest.** A skill needs to know what's in a single job log (whose path it already has) without
+  reading the full text into orchestrator context.
+
+## Inputs
+
+- `log: <path>` — full path to a downloaded job log.
+- `context: <one-line hint>` (optional) — e.g. job name + toolchain.
+
+Missing `log:` → abort.
+
+## Steps
+
+1. **Find the first real error.** Grep for `error|FAIL|exit code|##[error]` (case-insensitive) and read context
+   around the hits. Ignore retries of the same error — pick the underlying cause.
+2. **Identify the failing step.** GHA logs prefix each step with a `##[group]` banner; the command appears just
+   below (often with `+` from `set -x`).
+3. **Capture the diagnostic.** File:line + 1–2 lines of context for compiler/linker/test failures; step name for
+   infra failures.
+4. **Classify.** `code` (real failure) / `infra` (network, artifact, container pull, runner crash, OOM, timeout) /
+   `flaky` (known-flaky test, rest of run succeeded) / `unknown`.
+5. **CCCL-specific flags.** Specific toolchain combo (useful for `cccl-ci-overrides`), cluster of related
+   failures, path naming a recently-introduced change.
+
+## Output
+
+```
+**Job:** <full name from `context:` or `<log-basename>`>
+**Class:** code | infra | flaky | unknown
+
+**First real error** (log line <N>):
+  <one or two lines>
+
+**Failing step:** <step name>
+
+**Diagnostic:**
+  <2-4 lines with file:line>
+
+**CCCL flags:**
+  - <observation>
+```
+
+≤10 lines of body text.
diff --git a/.agent/skills/cccl-agent-impl/SKILL.md b/.agent/skills/cccl-agent-impl/SKILL.md
new file mode 100644
index 00000000000..67d9aeb56c3
--- /dev/null
+++ b/.agent/skills/cccl-agent-impl/SKILL.md
@@ -0,0 +1,54 @@
+---
+name: cccl-agent-impl
+description: "How skills and agents work in the CCCL repository. Filesystem layout, invocation, frontmatter, allow-list semantics, intent-driven auto-discovery. Load this skill when you land in the CCCL repo cold and don't know what skills or agents are, when you see references to `.agent/skills` or `.agent/agents` and want to understand them, or when authoring a new CCCL skill or agent."
+---
+
+# cccl-agent-impl
+
+## Filesystem
+
+```
+<repo>/.agent/
+  skills/<name>/SKILL.md
+  agents/<name>.md
+
+<repo>/.claude/
+  skills  -> ../.agent/skills    (directory symlink)
+  agents  -> ../.agent/agents    (directory symlink)
+  settings.json
+```
+
+Canonical files live under `.agent/`. Claude Code reads `.claude/skills/` and `.claude/agents/`; Codex reads
+`.agent/`.
+
+## Skills
+
+`.agent/skills/<name>/SKILL.md`. Frontmatter:
+
+```yaml
+---
+name: <kebab-case>
+description: "<trigger surface — used for intent matching>"
+---
+```
+
+Invoke via the **Skill tool** with `skill: <name>`. Not reentrant.
+
+## Agents
+
+`.agent/agents/<name>.md`. Frontmatter:
+
+```yaml
+---
+name: <name>
+description: "<what and when>"
+model: haiku
+tools: Read, Grep, Bash
+---
+```
+
+CCCL agents are **non-interactive** — no `AskUserQuestion`. User dialogue belongs in the calling skill (often via
+`cccl-clarify`). Pick `model:` per workload: `haiku` for mechanical tasks (log parsing, jq munging, SHA
+verification); `sonnet` for multi-file reasoning or judgment (e.g. `cccl-ci-overrides`).
+
+Dispatch via the **Agent tool** with `subagent_type: <name>`. The agent runs to completion and returns one message.
diff --git a/.agent/skills/cccl-bisect/SKILL.md b/.agent/skills/cccl-bisect/SKILL.md
new file mode 100644
index 00000000000..4da48132993
--- /dev/null
+++ b/.agent/skills/cccl-bisect/SKILL.md
@@ -0,0 +1,73 @@
+---
+name: cccl-bisect
+description: "Run a git bisect on CCCL to identify which commit introduced a regression. Two routes: cloud (dispatch `.github/workflows/git-bisect.yml` via `gh workflow run`, runs in CCCL CI infrastructure on a GPU runner) or local (invoke `ci/util/git_bisect.sh` via `.devcontainer/launch.sh`). Walks the user through preset / build-targets / ctest-targets / lit-tests / good-ref / bad-ref selection. Use when the user has a regression and wants to find the introducing commit. Trigger phrases: \"bisect this regression\", \"find when X broke\", \"git bisect\"."
+---
+
+# cccl-bisect
+
+Bisects are slow. Restrict build/test targets to the smallest set that reliably reproduces the regression.
+
+## Sources of truth
+
+- `.github/workflows/git-bisect.yml` — cloud-dispatch workflow.
+- `ci/util/git_bisect.sh` — local script wrapped by the workflow.
+- `ci/util/build_and_test_targets.sh` — per-commit configure/build/test driver.
+- `docs/cccl/development/build_and_bisect_tools.rst` — full docs.
+
+## Inputs needed
+
+- **`preset`** — CMake preset (e.g. `cub-cpp20`, `thrust-cpp17`, `libcudacxx`, `cudax`). `cmake --list-presets`
+  enumerates them.
+- **`build_targets`** — space-separated ninja targets.
+- **`ctest_targets`** — space-separated CTest `-R` regexes. Optional.
+- **`lit_precompile_tests` / `lit_tests`** — space-separated libcudacxx lit paths relative to
+  `libcudacxx/test/libcudacxx/`. Optional.
+- **`good_ref`** / **`bad_ref`** — commit/tag/branch, or `-Nd` ("N days ago on main", e.g. `-7d`), or empty
+  (defaults: latest release tag / `main`).
+- **`cmake_options`** — extra `-D…=…` flags. Optional.
+- **`launch_args`** — extra `--cuda X` / `--host Y` for devcontainer. Optional.
+
+Route ambiguous inputs through `cccl-clarify`.
+
+## Route 1 — cloud dispatch
+
+```
+gh workflow run git-bisect.yml --repo NVIDIA/cccl --ref <branch> \
+  -f runner='<runner-label>' \
+  -f preset='<preset>' \
+  -f build_targets='<targets>' \
+  -f ctest_targets='<regex>' \
+  -f good_ref='<good>' \
+  -f bad_ref='<bad>'
+```
+
+Runner labels:
+
+- `linux-amd64-cpu16` — 16-core CPU box (build-only bisects).
+- `linux-amd64-gpu-rtxa6000-latest-1` — RTX A6000, 1 GPU (test bisects).
+- Others: see the workflow file inputs.
+
+Return the run URL.
+
+## Route 2 — local
+
+Requires Docker.
+
+```
+.devcontainer/launch.sh -d <launch_args> --gpus all \
+  -- ./ci/util/git_bisect.sh \
+    --summary-file /tmp/shared/summary.md \
+    --good-ref '<good>' \
+    --bad-ref '<bad>' \
+    --preset '<preset>' \
+    --build-targets '<targets>' \
+    --ctest-targets '<regex>'
+```
+
+Single long Bash invocation — no `&&` chains.
+
+## Output
+
+Both routes write a `summary.md` capturing the found-bad commit (hash, author, message), the build/test command
+that distinguishes good from bad, and the bisect log. Cloud route surfaces a "Bisection Results" URL in the GHA
+step summary.
diff --git a/.agent/skills/cccl-build-and-test-targets/SKILL.md b/.agent/skills/cccl-build-and-test-targets/SKILL.md
new file mode 100644
index 00000000000..13a32e58c90
--- /dev/null
+++ b/.agent/skills/cccl-build-and-test-targets/SKILL.md
@@ -0,0 +1,73 @@
+---
+name: cccl-build-and-test-targets
+description: "Reference for `ci/util/build_and_test_targets.sh` — CCCL's preset-driven configure/build/test driver used by CI, the bisect workflow, and ad-hoc local runs. Covers `--preset`, `--cmake-options`, `--configure-override`, `--build-targets`, `--ctest-targets`, `--lit-precompile-tests`, `--lit-tests`, `--custom-test-cmd`. Use when the user wants to build or test a specific target without running the full CI matrix. Trigger phrases: \"build just X\", \"run test Y\", \"targeted build\", \"how do I run the cub tests\"."
+---
+
+# cccl-build-and-test-targets
+
+`ci/util/build_and_test_targets.sh` configures, builds, and tests a CMake preset with the targets you specify.
+Run it from the repo root, inside the devcontainer (or anywhere the preset's compiler is available).
+
+## Flags
+
+| Flag                               | Effect                                                                                          |
+|------------------------------------|-------------------------------------------------------------------------------------------------|
+| `--preset <name>`                  | CMake preset (or use `--configure-override` instead)                                            |
+| `--cmake-options "<flags>"`        | Extra `-D…=…` flags appended to preset configure                                                |
+| `--configure-override "<cmd>"`     | Custom configure command (overrides `--preset` and `--cmake-options`)                           |
+| `--build-targets "<targets>"`      | Space-separated ninja targets. Omit to skip build (`"all"` for everything)                      |
+| `--ctest-targets "<regex>"`        | Space-separated CTest `-R` regexes. Omit to skip tests (`"."` for all)                          |
+| `--lit-precompile-tests "<paths>"` | libcudacxx lit paths to compile without execution (relative to `libcudacxx/test/libcudacxx/`)   |
+| `--lit-tests "<paths>"`            | libcudacxx lit paths to compile AND execute                                                     |
+| `--custom-test-cmd "<cmd>"`        | Arbitrary command after tests                                                                   |
+
+`--build-targets` and `--ctest-targets` are opt-in. Omit → nothing builds or tests; the script just configures.
+
+## Common patterns
+
+Most cases: pick the preset and pass the target as both `--build-targets` and `--ctest-targets`:
+
+```
+ci/util/build_and_test_targets.sh \
+  --preset <preset> \
+  --build-targets "<target>" \
+  --ctest-targets "<target>"
+```
+
+| Project    | Preset(s)                        | Target example                |
+|------------|----------------------------------|-------------------------------|
+| CUB        | `cub-cpp17`, `cub-cpp20`         | `cub.cpp20.test.iterator`     |
+| Thrust     | `thrust-cpp17`, `thrust-cpp20`   | `thrust.cpp20.test.reduce`    |
+| cudax      | `cudax`                          | `cudax.cpp20.test.async_buffer` |
+| C Parallel | `cccl-c-parallel`                | `cccl.c.test.reduce`          |
+
+libcudacxx is lit-driven — use `--lit-precompile-tests` and `--lit-tests` instead of `--build-targets`:
+
+```
+ci/util/build_and_test_targets.sh \
+  --preset libcudacxx \
+  --lit-precompile-tests "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp" \
+  --lit-tests           "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp"
+```
+
+Avoid `--build-targets "libcudacxx.cpp20.precompile.lit"` — it precompiles the entire test suite.
+
+## Output
+
+Build dir at `build/${CCCL_BUILD_INFIX}/${PRESET}/` (parsed from the cmake configure log line
+`-- Build files have been written to:`). Phase-by-phase elapsed time printed with emoji status markers.
+
+## Wrapping in the devcontainer
+
+```
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 -- \
+  ./ci/util/build_and_test_targets.sh \
+    --preset cub-cpp20 \
+    --build-targets "cub.cpp20.test.iterator"
+```
+
+## vs full-matrix scripts
+
+- `build_and_test_targets.sh` — single preset, named targets. Fast iteration.
+- `./ci/build_<project>.sh` / `./ci/test_<project>.sh` — full build/test cycles across host/std/arch matrix. Slow.
+  See `cccl-cpp-builds`.
diff --git a/.agent/skills/cccl-ci-benchmarks/SKILL.md b/.agent/skills/cccl-ci-benchmarks/SKILL.md
new file mode 100644
index 00000000000..956eccc8798
--- /dev/null
+++ b/.agent/skills/cccl-ci-benchmarks/SKILL.md
@@ -0,0 +1,55 @@
+---
+name: cccl-ci-benchmarks
+description: "Request CCCL benchmark runs in PR CI by editing `ci/bench.yaml`, or launch benchmark workflows directly via `gh workflow run`. Walks the user through filter selection (CUB ninja-target regex / Python path regex), GPU selection, and the `[bench-only]` commit-tag convention. Use when the user wants to benchmark a change on PR CI, or trigger a one-off benchmark workflow. Trigger phrases: \"benchmark this PR\", \"request a perf run\", \"compare benchmarks before/after\"."
+---
+
+# cccl-ci-benchmarks
+
+Two routes: PR-driven (edit `ci/bench.yaml`, push) and direct dispatch (`gh workflow run`).
+
+`ci/bench.yaml` holds the request; `ci/bench.template.yaml` is the empty template CI checks against. Both must
+match to merge.
+
+## Route 1 — PR-driven
+
+1. **Edit `ci/bench.yaml`:**
+   - Add CUB benchmark regexes under `benchmarks.filters.cub` (matched against ninja target names, e.g.
+     `^cub\.bench\.for_each\.base`).
+   - Add Python benchmark path regexes under `benchmarks.filters.python` (matched against paths under
+     `benchmarks/`, e.g. `compute/reduce/sum\.py`).
+   - Uncomment at least one GPU under `benchmarks.gpus`: `t4`, `rtx2080`, `rtxa6000`, `l4`, `rtx4090`, `h100`,
+     `rtxpro6000`. Pools are shared — pick conservatively.
+   - Optionally adjust `launch_args` (e.g. `"--cuda 13.2 --host gcc14"`).
+
+2. **Append `[bench-only]`** to the commit message — skips non-benchmark CI (equivalent to
+   `[skip-matrix][skip-vdc][skip-docs][skip-tpt]`).
+
+3. **Push.** Inspect dispatched jobs via `gh run view <RUN_ID>`.
+
+4. **Reset before final merge.** Restore `ci/bench.yaml` to match `ci/bench.template.yaml` (empty filters, no GPUs
+   uncommented).
+
+## Route 2 — direct dispatch
+
+If a benchmark workflow exists for direct dispatch (`gh workflow list --repo NVIDIA/cccl`):
+
+```
+gh workflow run <workflow-name>.yml --repo NVIDIA/cccl --ref <branch> -f <input>=<value>
+```
+
+Return the run URL. `gh workflow run` is mutating; prompts every use.
+
+## Defaults
+
+From `ci/bench.yaml`'s `Advanced` block:
+
+- `base_ref: "origin/main"` — what to compare against.
+- `test_ref: "HEAD"` — what to test.
+- `arch: "native"` — usually fine; can be a list like `"80;90"`.
+- `nvbench_args` — preset with timeout / skip-time / stopping criterion / throttle handling.
+
+## Pitfalls
+
+- Forgetting to uncomment a GPU → no jobs run.
+- Forgetting `[bench-only]` → wasteful full-CI run alongside.
+- Not resetting `ci/bench.yaml` before merge → merge blocked.
diff --git a/.agent/skills/cccl-ci/SKILL.md b/.agent/skills/cccl-ci/SKILL.md
new file mode 100644
index 00000000000..4fbfd5faa5e
--- /dev/null
+++ b/.agent/skills/cccl-ci/SKILL.md
@@ -0,0 +1,54 @@
+---
+name: cccl-ci
+description: "Orientation for CCCL's GitHub Actions CI. Pointers to the sources of truth (`ci/matrix.yaml`, `ci-overview.md`, workflow files) and a map of the moving parts. Use when the user asks how CI works here, where a CI behavior is defined, why a job ran or didn't, or what `[skip-*]` tags exist. Trigger phrases: \"how does CI work\", \"where is X CI defined\", \"why did this job run\", \"explain the matrix\". For TRIAGING a CI failure, use `cccl-triage-pr` or `cccl-triage-nightly` instead."
+---
+
+# cccl-ci
+
+## Sources of truth
+
+| Topic                                         | File                                                              |
+|-----------------------------------------------|-------------------------------------------------------------------|
+| Job matrix (PR / nightly / weekly + override) | `ci/matrix.yaml`                                                  |
+| Skip tags, override rules, troubleshooting    | `ci-overview.md`                                                  |
+| Workflow entry points                         | `.github/workflows/ci-workflow-{pull-request,nightly,weekly}.yml` |
+| `/ok to test` policy + trustees               | `.github/copy-pr-bot.yaml`, `CONTRIBUTING.md` § CI                |
+| Per-job runner setup                          | `.github/actions/workflow-run-job-{linux,windows}/`               |
+| Matrix expansion → dispatchable jobs          | `.github/actions/workflow-build/` running `build-workflow.py`     |
+| Job pruning by changed paths                  | `ci/inspect_changes.py`                                           |
+| Result aggregation                            | `.github/actions/workflow-results/`                               |
+| Bench-request config                          | `ci/bench.yaml`                                                   |
+| Git-bisect cloud dispatch                     | `.github/workflows/git-bisect.yml`                                |
+
+## PR run flow
+
+`ci-workflow-pull-request.yml` → `build-workflow.py` reads `ci/matrix.yaml`. Non-empty `workflows.override` wins;
+otherwise `inspect_changes.py` prunes by dirty projects from changed paths. Jobs run through
+`workflow-run-job-{linux,windows}/` in a devcontainer. `workflow-results/` aggregates; marks failed if any job
+failed OR if override is non-empty.
+
+## Scoping a PR's CI (both block merging)
+
+- **`[skip-*]` tags** on the last commit. Tokens in `ci-overview.md`.
+- **`workflows.override` in `ci/matrix.yaml`** — replaces the `pull_request` matrix with a targeted subset:
+
+  ```yaml
+  workflows:
+    override:
+      - {jobs: ['build'], project: 'cudax', ctk: '12.0', std: 'all', cxx: ['msvc14.39', 'gcc10', 'clang14']}
+  ```
+
+`cccl-ci-overrides` generates both from failed-job names and/or changed-path lists.
+
+## `/ok to test` policy
+
+Draft PRs need `/ok to test <SHA>` from a maintainer to start CI. Route all such requests through the
+`cccl-ok-to-test` agent (SHA-gated).
+
+## Gotchas
+
+- Non-empty `workflows.override` blocks merge. Reset to empty before final merge (don't remove the key).
+- Any `[skip-*]` tag blocks merge.
+- `ci/bench.yaml` must match `ci/bench.template.yaml` to merge.
+- `gh pr view --json statusCheckRollup` returns 100k+ tokens for 500-job PRs. Use `gh pr checks`.
+- `gh run view --log-failed` errors mid-run. Use `gh api repos/NVIDIA/cccl/actions/jobs/<JID>/logs`.
diff --git a/.agent/skills/cccl-clarify/SKILL.md b/.agent/skills/cccl-clarify/SKILL.md
new file mode 100644
index 00000000000..8a6b25ee1c7
--- /dev/null
+++ b/.agent/skills/cccl-clarify/SKILL.md
@@ -0,0 +1,43 @@
+---
+name: cccl-clarify
+description: "Decision-point escalation. Use when you cannot resolve a question through default reasoning — tricky tradeoffs, scarce evidence, ambiguous user intent, or a fork in the road that needs human judgment. Triggered by phrases like \"I'm stuck\", \"not sure how to proceed\", \"should I X or Y\", \"help me decide\". Also invoked by other cccl-* skills when they need to surface a question to the user. Walks the three-step escalation (default reasoning → self-research → ask the user) and the \"how to ask well\" rules — print context in chat, AskUserQuestion with breakdown branch, point-by-point dialogue."
+---
+
+# cccl-clarify
+
+## Escalation ladder
+
+Stop at the first level that produces a confident answer.
+
+1. **Default reasoning** — resolve from existing context: prompt, conversation, files read, `AGENTS.md`, `cccl`
+   skill, memory. Escalate if the tradeoffs are balanced, evidence is thin, the decision is hard to reverse, or
+   intent is genuinely ambiguous.
+2. **Self-research** — cheapest source first: code, memory, in-repo docs (`AGENTS.md`, `CONTRIBUTING.md`,
+   `ci-overview.md`), upstream library docs, web, Explore subagent. Time-box. Two or three rounds without
+   confidence moving = escalate.
+3. **Ask the user** — when research won't close the gap.
+
+## How to ask well
+
+1. **Print context in chat.** Tool output isn't visible to the user. Frame the decision, what was tried, the
+   tradeoff axis — in your text, not just in the question prompt.
+2. **`AskUserQuestion` correctly.** 2–4 mutually-exclusive options (or `multiSelect`). Lead with the recommendation
+   and suffix `(Recommended)` when evidence favours it. Each option's `description` carries the substance. Don't
+   add "Other" — UI handles it.
+3. **Offer a breakdown branch** for non-trivial questions — a "walk me through it" option that lets the user defer
+   the pick.
+4. **Breakdown flow.** Offer further research (multi-select with "None — overview"). Then a 200–400 word overview:
+   problem, ordered decision points, tradeoffs, what's already decided. Walk point-by-point — dependent questions
+   sequential, not parallel. Confirm the chosen path end-to-end before acting.
+
+## When NOT to invoke
+
+- Single-line obvious fixes.
+- Conversational questions — answer them.
+- Decisions whose default is so obvious that asking is noise.
+- Questions answered in `AGENTS.md`, the `cccl` skill, or memory.
+
+## Hard prohibitions
+
+- Never invoke recursively.
+- Never use to defer a decision the user already made.
diff --git a/.agent/skills/cccl-commit/SKILL.md b/.agent/skills/cccl-commit/SKILL.md
new file mode 100644
index 00000000000..fa39252d9ce
--- /dev/null
+++ b/.agent/skills/cccl-commit/SKILL.md
@@ -0,0 +1,120 @@
+---
+name: cccl-commit
+description: "Walk uncommitted changes in a CCCL worktree through an interactive review-and-stage flow: survey the diff, optionally split into multiple commit groups, walk chunks one at a time with diff rendering and an action menu (stage / edit / defer / revert), optionally run a test gate, draft commit message(s), confirm, and commit. Use when committing uncommitted changes, preparing a branch for push, or wrapping up a fix. Trigger phrases: \"commit these changes\", \"wrap this up\", \"ready to commit\", \"stage and commit\", \"prepare commits\", \"split into commits\". For PR creation or `/ok to test`, route to `cccl-pr` after committing."
+---
+
+# cccl-commit
+
+Interactive commit prep. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Scratch dir:
+`mkdir -p /tmp/claude/<sessionid>`.
+
+## Step 1 — Component selection
+
+`AskUserQuestion`, `multiSelect: true`:
+
+- **Split** — group hunks into multiple commits.
+- **Interactive** — walk each chunk with a diff render + action menu.
+- **Test gate** — run `pre-commit` and a build/test target before committing.
+- **Commit** — write messages and execute. Without this, nothing commits.
+
+Commit-only with no Split / no Interactive → fast path: commit whatever is staged (Step 5).
+
+## Step 2 — Survey
+
+Single Bash each:
+
+- `git status -sb`
+- `git diff > /tmp/claude/<sessionid>/diff-unstaged.txt` (if > 2k lines)
+- `git diff --cached > /tmp/claude/<sessionid>/diff-staged.txt` (same threshold)
+- `git log --oneline -10`
+
+## Step 3 — Plan (if Split or Interactive)
+
+`git diff > /tmp/claude/<sessionid>/patch.txt` (or `git diff HEAD` for combined).
+
+Plan into commit groups CC-NN (one group if Split not selected). Within each group, slice into chunks; write each
+slice to `/tmp/claude/<sessionid>/chunks/CC-NN.patch`. Coverage check: sum-of-slice-hunks == total-hunks. Run
+`git apply --check chunks/CC-NN.patch` on every slice.
+
+Present plan summary (groups, chunks/group, total lines). `cccl-clarify` → approve / reorder / discuss.
+
+## Step 4 — Walk chunks (if Interactive)
+
+For each chunk in planned order:
+
+1. Read `chunks/CC-NN.patch`.
+2. Render the diff verbatim in chat as a ` ```diff ` fenced block, per-hunk headers naming file:line range.
+   Never use Bash output for diffs. Pattern dedup is fine for repetition — show pattern once, list other
+   occurrences and locations.
+3. Suggest improvements (numbered, with file:line refs) or note "No suggested changes".
+4. `AskUserQuestion`:
+   - **Stage as-is** — `git apply --cached chunks/CC-NN.patch`. Verify with `git diff --cached --stat`; STOP if
+     the staged file list doesn't match the expected set.
+   - **Apply suggested edits, re-review** — `Edit`, regenerate diff with `git diff -- <files>`, loop.
+   - **Apply custom edits, re-review** — user describes, `Edit`, loop.
+   - **Leave unstaged** — defer.
+   - **Revert** — `git apply -R chunks/CC-NN.patch` (or `git checkout -- <file>` for whole-file).
+   - **Discuss** — open conversation; loop.
+
+Track: current group, staged/deferred/reverted chunks.
+
+Split selected, Interactive not → auto-stage each slice in order. Verify the staged set grows monotonically into
+the per-group expected set. STOP on divergence.
+
+## Step 5 — Test gate (if selected) + commit
+
+### 5.0 Fast path
+
+Commit-only with no Split / no Interactive: confirm staged set via `git diff --cached --stat` (empty → exit),
+skip the test gate unless asked, go to 5.2.
+
+### 5.1 Tests
+
+`cccl-clarify` → skip / `pre-commit run --files <staged>` / dispatch `cccl-build-and-test-targets`. On failure:
+investigate / commit anyway / abort.
+
+### 5.2 Commit message
+
+`cccl-clarify` for detail tier — **Trivial** (subject only) / **Standard** (subject + 1–6 body lines) /
+**Detailed** (subject + multi-paragraph).
+
+Rules:
+- Subject ≤ 72 chars, imperative, no trailing period.
+- Match CCCL's prefix convention from `git log --oneline -20`.
+- Body wraps ~72 chars.
+- No co-author / tool-attribution footers.
+- `[skip-*]` tags apply to a single push and must appear on the LAST commit's last line only.
+
+Draft. `cccl-clarify` → use / revise / cancel.
+
+### 5.3 Commit
+
+Write final message to `/tmp/claude/<sessionid>/commit-msg-CC.txt`. Then `git commit -F <path>` (mutating; expect
+prompt). Verify with `git show -p HEAD`: SHA, subject, file list match expectations.
+
+## Step 6 — Inter-group transition (if Split)
+
+After each commit, `cccl-clarify` → continue / pause / end. On continue, verify remaining slices still apply
+(`git apply --check` per remaining slice); regenerate the patch and re-plan if any fail.
+
+Remind caller to use `cccl-ci-overrides` to setup a minimal CI run if needed.
+
+Last group → final summary (all SHAs, deferred, reverted) and exit.
+
+## Hard prohibitions
+
+Unless explicitly approved by the user in `cccl-clarify` at the moment of action, never do any of the following:
+
+- Never edit on `main`.
+- Never `--no-verify`.
+- Never `--amend` a published commit.
+- Never co-author / tool-attribution footers.
+
+In any circumstance:
+
+- Never fabricate diff content — every line shown comes from the patch or `git diff`.
+- Never `git add` without explicit per-chunk user approval.
+
+## Handoff
+
+After commits land: route to `cccl-pr` for push / open / update / `/ok to test`.
diff --git a/.agent/skills/cccl-cpp-builds/SKILL.md b/.agent/skills/cccl-cpp-builds/SKILL.md
new file mode 100644
index 00000000000..348244c5272
--- /dev/null
+++ b/.agent/skills/cccl-cpp-builds/SKILL.md
@@ -0,0 +1,53 @@
+---
+name: cccl-cpp-builds
+description: "Build and test CCCL's C++ libraries (libcudacxx, CUB, Thrust, cudax, C Parallel) — per-project `ci/build_*.sh` and `ci/test_*.sh` full-matrix scripts, architecture conventions, and pointers to the targeted-build alternative. Use when the user wants to build or test a CCCL C++ library across a full host/std/arch matrix, or asks about architecture flag syntax. Trigger phrases: \"build cub\", \"test libcudacxx\", \"build thrust\", \"full matrix build\", \"compile cudax\", \"cuda architectures\". For SINGLE-target fast iteration use `cccl-build-and-test-targets` instead."
+---
+
+# cccl-cpp-builds
+
+Per-project full-matrix build + test scripts under `ci/`. Flags: host compiler, C++ standard, GPU architectures.
+
+Full builds: 60+ min build, 30+ min test — never cancel. For single targets, use `cccl-build-and-test-targets`.
+
+## Scripts
+
+```
+./ci/build_<project>.sh  [-cxx <compiler>] [-std <std>] [-arch "<arch-list>"]   # no GPU
+./ci/test_<project>.sh    -cxx <compiler>   -std <std>   -arch "<arch-list>"    # GPU required
+```
+
+| Project           | Build / test scripts        | Stds      |
+|-------------------|-----------------------------|-----------|
+| CUB               | `build_cub`, `test_cub`     | 17, 20    |
+| Thrust            | `build_thrust`, `test_thrust` | 17, 20  |
+| libcudacxx        | `build_libcudacxx`, `test_libcudacxx` | 17, 20 |
+| cudax             | `build_cudax`, `test_cudax` | 20 only   |
+| C Parallel        | `build_cccl_c_parallel`     | 17 only   |
+
+Test scripts build implicitly if the tree is missing. CTest preset form (e.g. `ctest --preset=cub-cpp17`) also
+works.
+
+Compute-sanitizer variants: append `-compute-sanitizer-{memcheck,racecheck,initcheck,synccheck}`. Not all
+projects support all tools — check `--help`.
+
+## Flags
+
+- **`-cxx`** — host compiler (`g++`, `clang++`, `msvc14.39`).
+- **`-std`** — C++ standard (`17` or `20`, subject to project limits above).
+- **`-arch`** — semicolon-separated CUDA architecture list (CMake `CUDA_ARCHITECTURES`):
+
+  | Form             | Generates             |
+  |------------------|-----------------------|
+  | `<XX>`           | PTX + SASS for SM XX  |
+  | `<XX-real>`      | SASS only             |
+  | `<XX-virtual>`   | PTX only              |
+  | `native`         | Detect host GPU       |
+  | `all-major-cccl` | Default for PR builds |
+
+  Examples: `"native"`, `"80"`, `"70;75;80-virtual"`.
+
+## Performance
+
+- `sccache` is enabled in the devcontainer (CCCL-team bucket auth).
+- Limit `-arch` — `"native"` or `"80"` is much faster than `"all-major-cccl"`.
+- Build scripts already parallelize via ninja.
diff --git a/.agent/skills/cccl-devcontainers/SKILL.md b/.agent/skills/cccl-devcontainers/SKILL.md
new file mode 100644
index 00000000000..4920e18417c
--- /dev/null
+++ b/.agent/skills/cccl-devcontainers/SKILL.md
@@ -0,0 +1,58 @@
+---
+name: cccl-devcontainers
+description: "Use CCCL's `.devcontainer/launch.sh` to run one-off bash sessions, builds, or tests inside a CCCL-configured container with a chosen CUDA toolkit and host compiler. Covers the `-d` / `--cuda` / `--host` / `--gpus` / `--env` / `--volume` argument conventions and the `CCCL_BUILD_INFIX` already-in-container check. Use when the user wants to build/test in a clean, reproducible environment, run a quick experiment with a specific toolchain, or escape from host environment problems. Trigger phrases: \"run in devcontainer\", \"launch the container\", \"build with cuda 13.2\", \"open a shell with gcc 14\"."
+---
+
+# cccl-devcontainers
+
+`.devcontainer/launch.sh` boots a Docker container preconfigured with a chosen CUDA toolkit and host compiler,
+mounts the repo, and either drops into a shell or runs a script. **Linux-only** — Linux host, Linux container.
+Windows / MSVC builds run outside the devcontainer.
+
+## Flags
+
+| Flag                     | Purpose                                  |
+|--------------------------|------------------------------------------|
+| `-d`, `--docker`         | Run without VSCode (required for agents) |
+| `--cuda <version>`       | CUDA toolkit (e.g. `13.2`, `12.9`)       |
+| `--cuda-ext`             | Image with extended CTK libraries        |
+| `--host <compiler>`      | Host compiler (`gcc14`, `clang17`)       |
+| `--gpus <request>`       | GPU passthrough (`all` for everything)   |
+| `-e`, `--env KEY=VAL`    | Inject env var                           |
+| `-v`, `--volume SRC:DST` | Mount additional path                    |
+| `-- <script> [args]`     | Run script inside container after setup  |
+
+Examples:
+
+```
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14
+.devcontainer/launch.sh -d --cuda 12.9 --host gcc13 -- ./ci/build_cub.sh -cxx g++ -std 17 -arch native
+.devcontainer/launch.sh -d --gpus all -- ./ci/util/build_and_test_targets.sh --preset cub-cpp20 --build-targets "cub.cpp20.test.iterator"
+```
+
+## Already inside a container?
+
+`CCCL_BUILD_INFIX` is set inside the container. Before launching:
+
+```
+echo "$CCCL_BUILD_INFIX"
+```
+
+Non-empty → already inside; run the command directly. Nested launches don't work.
+
+First launch pulls the image; subsequent launches use cache.
+
+## Updating devcontainers
+
+Per-combination subdirs (`.devcontainer/cuda<version>-<host>/`) and their `devcontainer.json` files are
+**generated** — direct edits get overwritten. To change the set of available containers:
+
+1. Edit `ci/matrix.yaml` — the `dc` (and `dc_ext` for extended-CTK) entries control which CUDA × host-compiler
+   combinations exist.
+2. If the template itself needs changing, edit the base `.devcontainer/devcontainer.json`.
+3. Run `.devcontainer/make_devcontainers.sh --clean` from the repo root to regenerate per-combination subdirs and
+   prune stale ones.
+4. Push; CI's "Validate Devcontainer" jobs run.
+
+`[skip-vdc]` blocks Validate Devcontainer jobs. Don't use it on PRs that modify `.devcontainer/`, `ci/`, or
+`.github/`.
diff --git a/.agent/skills/libcudacxx-style/SKILL.md b/.agent/skills/cccl-libcudacxx-style/SKILL.md
similarity index 99%
rename from .agent/skills/libcudacxx-style/SKILL.md
rename to .agent/skills/cccl-libcudacxx-style/SKILL.md
index b30e06856ec..48eec852094 100644
--- a/.agent/skills/libcudacxx-style/SKILL.md
+++ b/.agent/skills/cccl-libcudacxx-style/SKILL.md
@@ -1,5 +1,5 @@
 ---
-name: libcudacxx-style
+name: cccl-libcudacxx-style
 description: Make the code in libcudacxx/include, cudax/include compliant with the coding style
 ---
 
diff --git a/.agent/skills/cccl-pr/SKILL.md b/.agent/skills/cccl-pr/SKILL.md
new file mode 100644
index 00000000000..cf1b6adca3b
--- /dev/null
+++ b/.agent/skills/cccl-pr/SKILL.md
@@ -0,0 +1,100 @@
+---
+name: cccl-pr
+description: "Manage CCCL pull requests — open a new draft PR after commits land, edit/comment on an existing PR (title, body, draft↔ready, comments), or push + post `/ok to test` to trigger CI. Detects fork-vs-upstream remote, opens drafts via `gh pr create --draft --repo NVIDIA/cccl`, dispatches `cccl-ok-to-test` for SHA-verified CI triggers. Use when pushing a branch and opening a PR, editing an existing PR's title/body, toggling draft/ready, commenting, or triggering CI. Trigger phrases: \"open a PR\", \"push and PR\", \"update PR description\", \"mark PR ready\", \"comment on the PR\", \"trigger CI on PR\". For commits, route to `cccl-commit` first."
+---
+
+# cccl-pr
+
+CCCL PR lifecycle. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Never force-pushes;
+never deletes branches; never closes/merges PRs.
+
+## Step 1 — Resolve mode
+
+`cccl-clarify` (or infer from phrasing):
+
+- **Open new draft PR** → Phase 1.
+- **Edit existing PR** (title / body / draft↔ready / base) → Phase 2.
+- **Comment** → Phase 3.
+- **Push + `/ok to test`** → Phase 4.
+
+## Phase 1 — Open a new draft PR
+
+### 1.1 Sanity checks
+
+- Refuse on `main` (`git rev-parse --git-dir` vs `--git-common-dir`).
+- Refuse if `git status --porcelain` is dirty (route to `cccl-commit`).
+- Confirm commits ahead: `git log --oneline origin/main..HEAD`.
+
+### 1.2 Detect push remote
+
+```
+gh auth status
+git remote -v
+gh pr view --json headRepositoryOwner   # if branch already has an upstream PR
+```
+
+Fork remote present → push there. Only `origin` and it points at `NVIDIA/cccl` → user is a maintainer; confirm
+before pushing. Ambiguous → `cccl-clarify`.
+
+### 1.3 Push
+
+`git push -u <remote> <branch>` (mutating; expect prompt). Capture any "view PR" URL hint from the output.
+
+### 1.4 Draft title + body, open PR
+
+Seed from `git log --oneline main..HEAD`. Title ≤ 72 chars, imperative. Body: bulleted commit summary, refs to
+issues/PRs, test plan when non-trivial. `cccl-clarify` → confirm / revise / cancel.
+
+Print the generated PR description to chat and ask the user to confirm or edit. On confirm, write to `/tmp/claude/<sessionid>/pr-body.md` and run:
+
+```
+gh pr create --draft --repo NVIDIA/cccl --base main \
+  --head <fork-owner>:<branch> \
+  --title "<title>" \
+  --body-file /tmp/claude/<sessionid>/pr-body.md
+```
+
+Capture the new PR number from the returned URL.
+
+### 1.5 Trigger CI
+
+`cccl-clarify` → dispatch `cccl-ok-to-test` now (recommended; drafts need `/ok to test <SHA>` to start CI). Then
+suggest `ScheduleWakeup(delaySeconds=1200)` polling on `gh pr checks <PR#>`.
+
+## Phase 2 — Edit an existing PR
+
+Resolve PR# from current branch (`gh pr view --json number`) or user input. `cccl-clarify`:
+
+- **Edit title** — draft, confirm, `gh pr edit <PR#> --title "<new>"`.
+- **Edit body** — read current via `gh pr view <PR#> --json body`, draft, confirm,
+  `gh pr edit <PR#> --body-file /tmp/claude/<sessionid>/pr-body.md`.
+- **Mark ready** — `gh pr ready <PR#>`.
+- **Mark draft** — `gh pr ready <PR#> --undo`.
+- **Change base** — `gh pr edit <PR#> --base <new-base>`. Rare.
+
+All mutating; one approval per use, never bundled.
+
+## Phase 3 — Comment
+
+Resolve PR#. Draft body, confirm via `cccl-clarify`, then:
+
+```
+gh pr comment <PR#> --repo NVIDIA/cccl --body "<comment>"
+```
+
+For `/ok to test <SHA>` specifically, use Phase 4 — the `cccl-ok-to-test` agent owns the SHA gate.
+
+## Phase 4 — Push + `/ok to test`
+
+For an existing PR whose branch has new local commits.
+
+1. `git push <remote> <branch>` (never force unless *explicitly* told by the user).
+2. Dispatch the `cccl-ok-to-test` agent. It owns the SHA verification, the comment, and the polling reminder.
+
+## Hard prohibitions
+
+- Never force-push (no `--force`, no `+<ref>`).
+- Never `gh pr close` / `gh pr merge` — out of scope.
+- Never bypass the `cccl-ok-to-test` SHA gate by posting `/ok to test` directly.
+- Never edit on `main`.
+- Never bundle multiple mutating ops into one approval.
diff --git a/.agent/skills/cccl-python/SKILL.md b/.agent/skills/cccl-python/SKILL.md
new file mode 100644
index 00000000000..56f65172d5c
--- /dev/null
+++ b/.agent/skills/cccl-python/SKILL.md
@@ -0,0 +1,46 @@
+---
+name: cccl-python
+description: "CCCL's Python packages (`cuda-cccl`): installation, module layout, build/test scripts, test organization. Use when the user works on the Python bindings, builds/tests Python components, or asks about the `cuda.compute` / `cuda.coop` / `cuda.cccl.headers` modules. Trigger phrases: \"cccl python\", \"cuda.compute\", \"cuda.coop\", \"cuda-cccl package\", \"build the python bindings\", \"test python\"."
+---
+
+# cccl-python
+
+Python components live under `python/cuda_cccl/`. Build/test scripts take `-py-version` instead of compiler flags.
+Supported: Python 3.10 – 3.13.
+
+## Modules
+
+- `cuda.compute` — device-level algorithms, iterators, custom GPU types.
+- `cuda.coop._experimental` — block/warp primitives for Numba CUDA.
+- `cuda.cccl.headers` — programmatic access to CCCL headers.
+
+## Install from source
+
+```
+pip install -e python/cuda_cccl[test-cu13]   # or [test-cu12] for CTK 12.X
+```
+
+Requires CTK 12.x or 13.x, NVIDIA GPU CC 6.0+. Base deps: `numba>=0.60.0`, `numpy`, `cuda-pathfinder>=1.2.3`,
+`cuda-core`, `typing_extensions`. CUDA extras add `cuda-bindings`, `cuda-toolkit`, `numba-cuda`.
+
+## Build / test
+
+```
+./ci/build_cuda_cccl_python.sh        -py-version 3.10
+./ci/test_cuda_compute_python.sh      -py-version 3.10
+./ci/test_cuda_coop_python.sh         -py-version 3.10
+./ci/test_cuda_cccl_headers_python.sh -py-version 3.10
+./ci/test_cuda_cccl_examples_python.sh -py-version 3.10
+```
+
+Build script needs no GPU; test scripts do.
+
+## Layout
+
+```
+python/cuda_cccl/
+├── cuda/{compute,coop,cccl/{parallel,cooperative,headers}}/
+├── tests/{compute,coop,headers}/  + test_examples.py
+├── benchmarks/
+└── pyproject.toml
+```
diff --git a/.agent/skills/cccl-resplit-branch/SKILL.md b/.agent/skills/cccl-resplit-branch/SKILL.md
new file mode 100644
index 00000000000..cbe2d216e30
--- /dev/null
+++ b/.agent/skills/cccl-resplit-branch/SKILL.md
@@ -0,0 +1,111 @@
+---
+name: cccl-resplit-branch
+description: "Rebase a CCCL feature branch onto `main` and resplit its commit history into a clean series, using the same interactive chunk-walkthrough as `cccl-commit`. Backs up the original branch tip, rebases (resolving conflicts), collapses commits to a single working-tree diff via `git reset --mixed`, then hands off to `cccl-commit`'s split / interactive / commit pipeline. Use when a branch has accumulated messy / squashable / out-of-order commits and needs a clean series before opening or refreshing a PR. Trigger phrases: \"resplit this branch\", \"clean up these commits\", \"rebase and resplit\", \"reorganize the commits\", \"squash and resplit\", \"fix up commit history\". For first-time commits on a fresh branch, use `cccl-commit`."
+---
+
+# cccl-resplit-branch
+
+Rebase onto `main`, then collapse the branch's commits into a working-tree diff and replay them as a clean series
+via `cccl-commit`'s flow. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Never
+force-pushes — that's `cccl-pr` Phase 4 with explicit user approval.
+
+## Step 1 — Pre-flight
+
+- Refuse on `main` (`git rev-parse --git-dir` vs `--git-common-dir`).
+- Working tree must be clean: `git status --porcelain` empty. Dirty → route to `cccl-commit` first.
+- Scratch: `mkdir -p /tmp/claude/<sessionid>`.
+- `git log --oneline main..HEAD > /tmp/claude/<sessionid>/original-commits.txt`. Empty → nothing to resplit;
+  exit. Branch is already pushed with review activity → `cccl-clarify` confirms the user wants to rewrite
+  published history (force-push will come later via `cccl-pr` Phase 4).
+
+## Step 2 — Backup the tip
+
+`cccl-clarify` confirms the backup ref name (default `refs/backup/<branch>-<YYYYMMDD-HHMMSS>`). Then:
+
+```
+git update-ref refs/backup/<branch>-<timestamp> HEAD
+```
+
+Surface the backup ref in every later confirmation prompt — recovery is `git reset --hard <ref>`.
+
+## Step 3 — Rebase onto `main`
+
+```
+git fetch origin main
+git rebase origin/main
+```
+
+On conflict, for each conflicted file route through `cccl-clarify`:
+
+- **Resolve manually** — read file, present conflict markers verbatim in chat, suggest resolution, user picks.
+- **Take ours** — `git checkout --ours <file>`.
+- **Take theirs** — `git checkout --theirs <file>`.
+- **Skip commit** — `git rebase --skip` (loses content; only for already-redone work).
+- **Abort** — `git rebase --abort`; surface backup ref; exit.
+
+After resolution: `git add <file>` per-file (never bulk-stage), then `git rebase --continue`.
+
+### 3.1 Verify
+
+```
+git diff main..HEAD --stat > /tmp/claude/<sessionid>/rebased-diff-stat.txt
+```
+
+Compare touched-file set to the pre-rebase commit list. Material mismatch → `cccl-clarify` (continue / inspect /
+abort to backup).
+
+## Step 4 — Collapse to working tree
+
+```
+git reset --mixed main
+```
+
+`--mixed` keeps every change in the working tree, unstaged — the starting state `cccl-commit` expects. **Never
+`--hard`** (would discard the work). Mutating; expect prompt; surface the backup ref in the prompt.
+
+Verify: `git diff --stat` must match the rebased diff stat from Step 3.1. Material divergence → STOP.
+
+## Step 5 — Hand off to `cccl-commit`
+
+Run `cccl-commit` from Step 1 onward. Splitting and Committing are implicit (a resplit means at least one new
+commit), but offer Interactive (strongly recommended — catches drift the original series hid) and Test gate via
+`cccl-clarify`.
+
+Seed the chunk planner from the original commit series (read `original-commits.txt`) — the resplit's job is to
+*fix* problems, not invent unrelated structure. Use original commit subjects as starting drafts for the new
+messages.
+
+## Step 6 — Final tree check
+
+After the last commit:
+
+```
+git diff HEAD refs/backup/<branch>-<timestamp> --stat
+```
+
+Non-empty → the new branch diverges from the original. Present the delta via `cccl-clarify`:
+
+- **Expected** (user reverted / edited chunks during walkthrough) — accept.
+- **Unexpected** — investigate, or `git reset --hard <backup>` to abort.
+
+Report final tip SHA, commit list, backup ref location, and a force-push reminder if the branch was published.
+
+## Recovery
+
+At any time before commits start landing: `git reset --hard refs/backup/<branch>-<timestamp>` restores the
+original tip. After commits land: same command, but the new series is lost; surface this trade-off when the user
+asks to abort late.
+
+## Hard prohibitions
+
+- Never `git reset --hard` outside an explicit user-confirmed abort.
+- Never force-push — `cccl-pr` Phase 4 owns that with its own approval.
+- Never delete a backup ref without per-ref user approval.
+- Never `--no-verify`.
+- Never co-author / tool-attribution footers.
+- Never `git rebase --abort` autonomously — only on explicit user choice.
+
+## Handoff to `cccl-pr`
+
+If the branch was published, the resplit requires a force-push. Route to `cccl-pr` Phase 4 — and note its
+current force-push prohibition. Until that's opted-in, the user runs `git push --force-with-lease` by hand.
diff --git a/.agent/skills/cccl-sass-diff/SKILL.md b/.agent/skills/cccl-sass-diff/SKILL.md
new file mode 100644
index 00000000000..1f4d8944ec6
--- /dev/null
+++ b/.agent/skills/cccl-sass-diff/SKILL.md
@@ -0,0 +1,32 @@
+---
+name: cccl-sass-diff
+description: "Compare CUDA SASS or PTX between two CCCL builds (commits, branches, working-copy vs HEAD) to detect non-trivial codegen changes while filtering noise from addresses, symbols, metadata, and pure register renaming. Use when the user asks to check for SASS changes, audit ABI/codegen impact of a change, or compare PTX. Trigger phrases: \"check for SASS changes\", \"compare SASS\", \"any codegen impact\", \"PTX diff\"."
+---
+
+# cccl-sass-diff
+
+Detect meaningful changes in generated CUDA machine code between two versions. Filter trivial noise so only
+behavior- or performance-affecting changes surface.
+
+## Inputs
+
+Ask via `cccl-clarify` if unclear:
+
+- Target/library being built.
+- SM architectures (detect, offer, confirm).
+- Baseline + candidate refs.
+- SASS (`cuobjdump -sass`) or PTX (`cuobjdump -ptx`).
+
+## Workflow
+
+1. Build both versions with the same arches and flags.
+2. Dump disassembly to `/tmp/claude/<sessionid>/{baseline,candidate}.sass`.
+3. Normalize both identically: strip addresses, build IDs, paths, timestamps, whitespace; drop empty/comment lines.
+4. `diff -u` the normalized listings.
+5. Classify — ignore register renames with identical opcodes/operands, label renumbering, formatting-only
+   differences.
+6. Report top 5 non-trivial regions: kernel name, change type (opcode, memory-access size, register count delta,
+   control-flow), normalized line numbers, plain-language interpretation. Or: "only noise detected".
+
+Save raw + normalized dumps and the diff command under `/tmp/claude/<sessionid>/`. Unsure on impact → surface and
+ask.
diff --git a/.agent/skills/cccl-triage-nightly/SKILL.md b/.agent/skills/cccl-triage-nightly/SKILL.md
new file mode 100644
index 00000000000..f1e2e474cba
--- /dev/null
+++ b/.agent/skills/cccl-triage-nightly/SKILL.md
@@ -0,0 +1,42 @@
+---
+name: cccl-triage-nightly
+description: "Diagnose failures in the latest scheduled CCCL nightly run on `main` in the CCCL repository. Locates the run, groups failures by toolchain/project, fetches representative logs, summarizes, presents to user, and — on approval — applies fixes against a new branch, opens a draft PR, posts `/ok to test <SHA>`. Use when the user asks to triage, diagnose, or fix nightly CI. Trigger phrases: \"triage the nightly\", \"what failed in nightly\", \"diagnose latest nightly\", \"fix nightly CI\", \"investigate nightly run\"."
+argument-hint: "[run-id-or-empty]"
+---
+
+# cccl-triage-nightly
+
+Same shape as `cccl-triage-pr`, but starts from a workflow run and ends by opening a fresh draft PR.
+
+Scratch dir, single-Bash discipline, worktree safety, and `cccl-clarify` routing match `cccl-triage-pr`.
+
+## Step 1 — Locate the run
+
+User-supplied run ID wins. Otherwise:
+
+```
+gh run list --workflow=ci-workflow-nightly.yml --branch=main --limit=1 --json databaseId,conclusion,createdAt,headSha > /tmp/claude/<sessionid>/nightly_run.json
+```
+
+Capture `databaseId` and `headSha`. `conclusion: success` → stop.
+
+## Step 2 — Fetch failures
+
+Dispatch `cccl-fetch-ci-failures` with the run ID.
+
+## Steps 3–7 — Group, fetch logs, summarize, present, diagnose
+
+Identical to `cccl-triage-pr` steps 3–7.
+
+## Step 8 — Ship the fix
+
+No existing branch or PR — open a fresh one.
+
+1. **Worktree safety.** Refuse on `main`. Offer to create a new named branch via `cccl-clarify`.
+2. **Apply edits.** Per-file approval via `cccl-clarify`. Offer `gh issue create` for any deferred problems.
+3. **Override matrix + skip tags.** Dispatch `cccl-ci-overrides` with `failed_jobs:` (TSV path), `paths:` (edited
+   files), `for_workflow: nightly`. Reference the nightly run ID in the override comment; skip tags apply to the
+   LAST commit only.
+4. **Commit.** Route to `cccl-commit`.
+5. **Open PR + `/ok to test`.** Route to `cccl-pr` Phase 1. PR body should reference the nightly run + per-cluster
+   diagnosis. Multiple PRs → run Phase 1 once per branch, framed via `cccl-clarify`.
diff --git a/.agent/skills/cccl-triage-pr/SKILL.md b/.agent/skills/cccl-triage-pr/SKILL.md
new file mode 100644
index 00000000000..89ba4bde38a
--- /dev/null
+++ b/.agent/skills/cccl-triage-pr/SKILL.md
@@ -0,0 +1,85 @@
+---
+name: cccl-triage-pr
+description: "Diagnose and (optionally) fix CI failures on the current branch's open PR in the CCCL repository. Resolves the PR from the current branch, groups failed checks by likely root cause, pulls representative logs, summarizes them, presents findings, and — on user approval — applies fixes, adds override matrix + skip tags, commits, pushes, posts `/ok to test <SHA>`. Use when the user wants to investigate or fix CI failures on a PR. Trigger phrases: \"diagnose the PR\", \"fix CI on this PR\", \"what's failing in CI\", \"investigate this PR's CI\"."
+argument-hint: "[PR-number]"
+---
+
+# cccl-triage-pr
+
+Route user-question moments through `cccl-clarify`. Create the scratch dir once: `mkdir -p /tmp/claude/<sessionid>`.
+
+## Step 1 — Resolve PR
+
+User-supplied PR# wins. Otherwise:
+
+```
+gh pr view --json number,title,state,url,headRefName,isDraft,headRefOid > /tmp/claude/<sessionid>/pr_meta.json
+```
+
+Capture `number` and `headRefOid`.
+
+## Step 2 — Fetch failures
+
+Dispatch `cccl-fetch-ci-failures` with the PR number. The agent writes a TSV to a path you specify
+(`/tmp/claude/<sessionid>/failed_jobs.tsv`): `(job-id, name, grouping-hint)` per row.
+
+Zero failures → report and offer to wait. If waiting, schedule `ScheduleWakeup(delaySeconds=1200)`.
+
+## Step 3 — Group + pick representatives
+
+Bucket failures by shared axes (toolchain, library, variant, platform, phase). Pick one representative JID per
+group. Don't fetch every failure's logs.
+
+## Step 4 — Pull representative logs
+
+For each representative:
+
+```
+gh api repos/NVIDIA/cccl/actions/jobs/<JID>/logs > /tmp/claude/<sessionid>/job_<JID>.log
+```
+
+Works mid-run, unlike `gh run view --log-failed`.
+
+## Step 5 — Summarize via `cccl-summarize-job-log`
+
+Dispatch one agent per log, in parallel. Each returns 5–10 lines.
+
+## Step 6 — Present + ask
+
+Compact table:
+
+```
+Group                              | Repr JID    | Likely cause             | Affected
+---------------------------------- | ----------- | ------------------------ | --------
+CTK13.2 GCC15 C++20 TestNoLaunch   | 74849038365 | infra: artifact download | 1
+CTK12.0 GCC8 C++17 CUB Build       | 7484903xxxx | -Wunused-but-set-param   | 8
+```
+
+Route through `cccl-clarify` to ask which groups to dig into.
+
+## Step 7 — Diagnose accepted groups
+
+Re-read representative logs; cross-reference repo code where the error names a file or function. Present:
+
+1. **What broke** — concrete error.
+2. **Why** — root-cause hypothesis.
+3. **Suggested fix** — concrete change, "rerun — transient infra", or "needs upstream report".
+4. **Confidence** — high/medium/low + one-line reason.
+
+For infra-only failures, suggest `gh run rerun <RUN_ID> --failed`.
+
+## Step 8 — Ship the fix
+
+1. **Worktree safety.** Refuse on `main`.
+2. **Apply edits.** Per-file approval via `cccl-clarify`.
+3. **Override matrix + skip tags.** Dispatch `cccl-ci-overrides` with `failed_jobs:` (TSV path) + `paths:` (edited
+   files). Offer the YAML and tag set via `cccl-clarify`. Skip tags apply to the LAST commit only.
+4. **Commit.** Route to `cccl-commit`.
+5. **Push + `/ok to test`.** Route to `cccl-pr` Phase 4.
+
+## Pitfalls
+
+- `gh pr checks` exits 1 when any check failed — expected.
+- Avoid `gh pr view --json statusCheckRollup` — 100k+ tokens for 500-job PRs.
+- Avoid `gh run view --log-failed` mid-run; use `gh api .../jobs/<JID>/logs` instead.
+- Don't fetch every failure's logs — one representative per cluster.
diff --git a/.agent/skills/cccl/SKILL.md b/.agent/skills/cccl/SKILL.md
new file mode 100644
index 00000000000..a2a229c7c07
--- /dev/null
+++ b/.agent/skills/cccl/SKILL.md
@@ -0,0 +1,39 @@
+---
+name: cccl
+description: "Entry-point orientation for the CCCL repository. Surfaces the available CCCL-specific skills and agents and points at common entry phrases. Load this skill first in every CCCL session before doing other work. Use when starting any task in this repo, when unsure which CCCL skill to use, or when introduced to the repo cold."
+---
+
+# cccl
+
+Skills live in `.agent/skills/`; agents live in `.agent/agents/`. `.claude/skills` and `.claude/agents` symlink to
+those so Claude Code and Codex find the same files.
+
+If you don't know how skill or agent invocation works, load `cccl-agent-impl` first.
+
+## Where to start by intent
+
+| Intent                                         | Load                                                |
+|------------------------------------------------|-----------------------------------------------------|
+| Commit uncommitted changes / wrap up a fix     | `cccl-commit`                                       |
+| Resplit / clean up a branch's commit history   | `cccl-resplit-branch`                               |
+| Open / edit / comment on a PR / trigger CI     | `cccl-pr`                                           |
+| Diagnose CI on this PR / what's failing        | `cccl-triage-pr`                                    |
+| Triage nightly / fix nightly CI                | `cccl-triage-nightly`                               |
+| Stuck on a decision / should I X or Y          | `cccl-clarify`                                      |
+| Post `/ok to test`                             | `cccl-ok-to-test` agent (called by a skill)         |
+| Generate override matrix / skip tags           | `cccl-ci-overrides` agent (called by a skill)       |
+| How does CI work / where is X CI defined       | `cccl-ci`                                           |
+| Set up a benchmark on this PR                  | `cccl-ci-benchmarks`                                |
+| Git bisect a regression                        | `cccl-bisect`                                       |
+| Build / test in the devcontainer               | `cccl-devcontainers`, `cccl-build-and-test-targets` |
+| Build cub / thrust / libcudacxx / cudax (full) | `cccl-cpp-builds`                                   |
+| Work on / build / test the python bindings     | `cccl-python`                                       |
+| Check for SASS/PTX changes                     | `cccl-sass-diff`                                    |
+| libcudacxx code style                          | `cccl-libcudacxx-style`                             |
+
+## Repo conventions
+
+- **Scratch space**: `/tmp/claude/<sessionid>/`. Create with `mkdir -p`. Don't pipe; redirect to a file and Read.
+- **CI** uses `ci/matrix.yaml` with optional `workflows.override` to scope PR jobs; `[skip-*]` commit tags scope
+  further. Both block merging while present. See `cccl-ci`.
+- **`/ok to test <SHA>`** is required from a maintainer for external PRs. The `cccl-ok-to-test` agent posts it.
diff --git a/.claude/agents b/.claude/agents
new file mode 120000
index 00000000000..7f6880019d4
--- /dev/null
+++ b/.claude/agents
@@ -0,0 +1 @@
+../.agent/agents
\ No newline at end of file
diff --git a/.claude/settings.json b/.claude/settings.json
new file mode 100644
index 00000000000..5e73bf22873
--- /dev/null
+++ b/.claude/settings.json
@@ -0,0 +1,69 @@
+{
+  "$schema": "https://json.schemastore.org/claude-code-settings.json",
+  "permissions": {
+    "additionalDirectories": [
+      "/tmp/claude"
+    ],
+    "allow": [
+      "Bash(gh auth status)",
+      "Bash(gh pr view *)",
+      "Bash(gh pr checks *)",
+      "Bash(gh pr list *)",
+      "Bash(gh pr diff *)",
+      "Bash(gh run view *)",
+      "Bash(gh run list *)",
+      "Bash(gh workflow list *)",
+      "Bash(gh workflow view *)",
+      "Bash(gh issue view *)",
+      "Bash(gh issue list *)",
+      "Bash(gh search *)",
+      "Bash(gh api repos/NVIDIA/cccl/actions/jobs/*)",
+      "Bash(gh api repos/NVIDIA/cccl/actions/runs/*/jobs*)",
+      "Bash(git status)",
+      "Bash(git status *)",
+      "Bash(git log *)",
+      "Bash(git diff *)",
+      "Bash(git show *)",
+      "Bash(git blame *)",
+      "Bash(git rev-parse *)",
+      "Bash(git branch --show-current)",
+      "Bash(git branch -a)",
+      "Bash(git branch -v)",
+      "Bash(git branch --list *)",
+      "Bash(git remote)",
+      "Bash(git remote -v)",
+      "Bash(git remote get-url *)",
+      "Bash(git ls-files *)",
+      "Bash(git worktree list)",
+      "Bash(git worktree list *)",
+      "Bash(git check-ignore *)",
+      "Bash(rg *)",
+      "Bash(grep *)",
+      "Bash(jq *)",
+      "Bash(sed -n *)",
+      "Bash(ls)",
+      "Bash(ls *)",
+      "Bash(pwd)",
+      "Bash(cat *)",
+      "Bash(head *)",
+      "Bash(tail *)",
+      "Bash(wc *)",
+      "Bash(file *)",
+      "Bash(stat *)",
+      "Bash(command -v *)",
+      "Bash(mkdir -p /tmp/claude/*)"
+    ]
+  },
+  "hooks": {
+    "SessionStart": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "printf '%s' '{\"hookSpecificOutput\":{\"hookEventName\":\"SessionStart\",\"additionalContext\":\"CCCL repo orientation: load the `cccl` skill via the Skill tool BEFORE doing anything else. It is the entry point that surfaces all repo-local skills (.claude/skills -> .agent/skills) and agents (.claude/agents -> .agent/agents). If you are unfamiliar with how skills/agents work in this repo, load `cccl-agent-impl` first.\"}}'"
+          }
+        ]
+      }
+    ]
+  }
+}
diff --git a/.claude/skills b/.claude/skills
new file mode 120000
index 00000000000..9b058317d13
--- /dev/null
+++ b/.claude/skills
@@ -0,0 +1 @@
+../.agent/skills
\ No newline at end of file
diff --git a/.claude/skills/libcudacxx-style/SKILL.md b/.claude/skills/libcudacxx-style/SKILL.md
deleted file mode 100644
index 7e2aa7a3fee..00000000000
--- a/.claude/skills/libcudacxx-style/SKILL.md
+++ /dev/null
@@ -1,6 +0,0 @@
----
-name: libcudacxx-style
-description: Make the code in libcudacxx/include, cudax/include compliant with the coding style
----
-
-The skill content is in .agent/skills/libcudacxx-style/SKILL.md
diff --git a/AGENTS.md b/AGENTS.md
index 240568675a0..f8ba92d65b3 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,453 +1,86 @@
 # Agent Instructions
 
-This document provides guidelines for building, testing, and contributing to the CCCL repository. It is primarily written for agentic AIs, but the information is also useful for CCCL developers.
+## Load the `cccl` skill first
 
----
+Load the `cccl` skill via the Skill tool. It maps the available repo-local skills and agents and routes by user intent.
 
-## Overview
+If you don't know what skills or agents are, load `cccl-agent-impl` first.
 
-CCCL is a collection of CUDA C++ libraries and Python packages:
+## What CCCL is
 
-* **libcudacxx** — CUDA C++ Standard Library
-* **CUB** — Block-level primitives
-* **Thrust** — High-level parallel algorithms
-* **cudax** — Experimental features
-* **C Parallel Library** — C bindings for CCCL algorithms
-* **Python CCCL packages** (`cuda-cccl`) — Python bindings for parallel and cooperative primitives
+CCCL — the CUDA Core Compute Libraries — is a collection of CUDA C++ libraries and Python packages:
 
-The repository uses **CMake** with the **Ninja** generator and provides standardized presets for consistent builds.
+- **libcudacxx** — CUDA C++ Standard Library
+- **CUB** — Block-level primitives
+- **Thrust** — High-level parallel algorithms
+- **cudax** — Experimental features
+- **C Parallel Library** — C bindings for CCCL algorithms
+- **cuda-cccl Python packages** — Python bindings for parallel + cooperative primitives
 
----
+Built with CMake + Ninja via the presets in `CMakePresets.json`.
 
-## Iteration Cycles
-
-For a given task, you should:
-
-1. Research. Search the web, read existing code, look up system/dependency headers / implementations of related functionality. Figure out best practices and common pitfalls. Look for existing tests of the functionality; if none exist, plan a new test that integrates with the relevant existing testing frameworks.
-2. Plan. Create a high-level plan to implement the requested feature.
-3. Review and Refine plan. Look for pitfalls, find ways to smooth out rough edges. Verify any assumptions, edgecases, or identified pitfalls. Repeat until the plan is solid.
-4. Gather consistency context. Look at similar code (sibling classes if possible, otherwise just related source files) to learn the style and patterns used in the project. Consistency is important -- similar features should be organized and implemented similarly. Naming conventions should be followed.
-5. If requested: Present the plan. Only do this if the user asks for a plan to do something -- if they just ask you implement something without requesting a plan, skip this step.
-6. Draft. Implement the requested task to the best of your ability.
-7. Review and Refine. Read through your changes. Verify that API calls are correct. Assess clarity, performance, and readability. Iterate as needed.
-8. Style check. Ensure that your changes follow style and naming conventions.
-9. Build and test. Once you're confident that your changes are functionally and stylistically correct start build, test, and iterate cycles. If you don't have permissions to do these, ask the user to run specific build/test commands for you.
-
----
-
-## Known Agent Limitations
-
-### OpenAI Codex
-
-Codex cloud instances cannot:
-
-* Run Docker containers with devcontainer scripts
-* Access GPUs or run GPU-dependent tests
-
----
-
-## Build and Test Tools
-
-All CCCL subprojects are computationally expensive to build and test. Use the provided helper scripts to minimize work and target only what you need.
-
-### CMake Presets
-
-Presets are defined in `CMakePresets.json`. Names follow a `project` or `<project>-cpp<std>` format, such as `cub-cpp20`, `thrust-cpp17`, or `libcudacxx`. Use `cmake --list-presets` to view available options. Build trees are placed under `build/${CCCL_BUILD_INFIX}/${PRESET}`.
-
-### `.devcontainer/launch.sh`
-
-Launches a container configured with a CUDA Toolkit and host compiler. First startup may take time, but cached environments are faster. In agent environments, container launches may not be supported. To check if you are already inside a container, verify if `CCCL_BUILD_INFIX` is set.
-
-Common options:
-
-* `-d, --docker` — Run without VSCode (required for agents)
-* `--cuda <version>` — Select CUDA Toolkit (optional)
-* `--cuda-ext` — Use a docker image with extended CTK libraries
-* `--host <compiler>` — Select host compiler (optional)
-* `--gpus <request>` — GPU devices to add to the container (use `all` to pass all GPUs)
-* `-e/--env`, `-v/--volume` — Environment variables / volume mounts
-* `-- <script>` — Run script inside container after setup
-
-Example:
-
-```bash
-.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 -- <script> [args...]
-```
-
-### `ci/util/build_and_test_targets.sh`
-
-Configures, builds, and tests selected Ninja, CTest, or lit targets. Many tests require GPUs. Options that generally work without GPUs include `--preset`, `--cmake-options`, `--configure-override`, `--build-targets`, `--lit-precompile-tests`, and `--custom-test-cmd`.
-
-Key options:
-
-* `--preset <name>` — Use a CMake preset
-* `--cmake-options <str>` — Extra CMake arguments
-* `--configure-override <cmd>` — Custom configuration command
-* `--build-targets "<targets>"` — Space-separated Ninja targets
-* `--ctest-targets "<regex>"` — Regex for CTest targets (may fail without GPUs)
-* `--lit-precompile-tests "<paths>"` — Precompile specified libcudacxx lit tests (paths are relative to `libcudacxx/test/libcudacxx/`)
-* `--lit-tests "<paths>"` — Run specified libcudacxx lit tests (also relative to `libcudacxx/test/libcudacxx/`)
-* `--custom-test-cmd "<cmd>"` — Run arbitrary command after tests
-
-### `ci/util/git_bisect.sh`
-
-Wraps `git bisect` with the build/test helper. Useful for identifying regression commits. Can take a very long time—minimize scope by restricting build/test targets.
-
-Extra options:
-
-* `--good-ref <rev>` — Known good commit/tag, or `-Nd` for origin/main N days ago (default: latest release)
-* `--bad-ref <rev>` — Known bad commit/tag, or `-Nd` (default: origin/main)
-
-See `docs/cccl/development/build_and_bisect_tools.rst` for details.
-
----
-
-## Building and Testing
-
-Always prefer targeted builds and tests, as full builds are time-consuming. If required tools or hardware are unavailable, note this in the PR but run as many relevant tests as possible.
-
-### Targeted Build and Test Examples
-
-* **CUB** (`cub/`):
-
-```bash
-ci/util/build_and_test_targets.sh \
-  --preset cub-cpp20 \
-  --build-targets "cub.cpp20.test.iterator" \
-  --ctest-targets "cub.cpp20.test.iterator"
-```
-
-* **Thrust** (`thrust/`):
-
-```bash
-ci/util/build_and_test_targets.sh \
-  --preset thrust-cpp20 \
-  --build-targets "thrust.cpp20.test.reduce" \
-  --ctest-targets "thrust.cpp20.test.reduce"
-```
-
-* **libcudacxx** (`libcudacxx/`):
-  Avoid the expensive `libcudacxx.cpp20.precompile.lit`. Instead, precompile and run a small set of lit tests:
-
-```bash
-ci/util/build_and_test_targets.sh \
-  --preset libcudacxx \
-  --lit-precompile-tests "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp" \
-  --lit-tests "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp"
-```
-
-* **CUDA Experimental** (`cudax/`):
-
-```bash
-ci/util/build_and_test_targets.sh \
-  --preset cudax \
-  --build-targets "cudax.cpp20.test.async_buffer" \
-  --ctest-targets "cudax.cpp20.test.async_buffer"
-```
-
-* **C Parallel API** (`c/parallel/`):
-
-```bash
-ci/util/build_and_test_targets.sh \
-  --preset cccl-c-parallel \
-  --build-targets "cccl.c.test.reduce" \
-  --ctest-targets "cccl.c.test.reduce"
-```
-
-### Full Builds
-
-> ⚠️ **Important:** Full builds are costly. Always allow 60+ minutes for builds and 30+ minutes for tests. Do not cancel once started.
-
-Use scripts like:
-
-```bash
-./ci/build_cub.sh [-cxx g++] [-std 17] [-arch "75;80;90;120"]
-./ci/build_thrust.sh [-cxx clang++] [-std 17] [-arch "75;80;90;120"]
-./ci/build_libcudacxx.sh [-cxx g++] [-std 17] [-arch "75;80;90;120"]
-./ci/build_cudax.sh [-cxx g++] [-std 20] [-arch "75;80;90;120"]
-./ci/build_cccl_c_parallel.sh [-cxx g++] [-std 17] [-arch "75;80;90;120"]
-./ci/build_cuda_cccl_python.sh -py-version 3.10
-```
-
-### Architectures
-
-* `<XX>` — Generate PTX and SASS
-* `<XX-real>` — Generate only SASS
-* `<XX-virtual>` — Generate only PTX
-* `native` — Detect host GPU
-* `all-major-cccl` — Default for PR builds
-
-### Testing
-
-> ⚠️ Requires an NVIDIA GPU. Tests take 15+ minutes. Use targeted testing whenever possible.
-
-Examples:
-
-```bash
-./ci/test_cub.sh -cxx g++ -std 17 -arch "75;80;90;120"
-./ci/test_thrust.sh -cxx g++ -std 17 -arch "75;80;90;120"
-./ci/test_libcudacxx.sh -cxx g++ -std 17 -arch "75;80;90;120"
-./ci/test_cudax.sh -cxx g++ -std 20 -arch "75;80;90;120"
-ctest --preset=cub-cpp17
-```
-
-Options:
-
-* `-compute-sanitizer-memcheck` — Run with memory checking or other compute-sanitizer tools (not all projects support this)
-
----
-
-## Python CCCL Packages
-
-Python components require different parameters than C++ builds. Use `-py-version` instead of compiler flags.
-
-Supported versions: `3.10`, `3.11`, `3.12`, `3.13`
-
-### Modules
-
-* **cuda.compute** — Device-level algorithms, iterators, custom GPU types
-* **cuda.coop._experimental** — Block/warp-level primitives for Numba CUDA
-* **cuda.cccl.headers** — Programmatic access to headers
-
-### Installation
-
-From PyPI:
-
-```bash
-pip install cuda-cccl[cu13] # or [cu12] for CTK 12.X
-```
-
-From conda-forge:
-
-```bash
-conda install -c conda-forge cccl-python
-```
-
-From source:
-
-```bash
-git clone https://github.com/NVIDIA/cccl.git
-cd cccl/python/cuda_cccl
-pip install -e .[test-cu13] # or [test-cu12] for CTK 12.X
-```
-
-Requirements:
-
-* Python 3.10+
-* CUDA Toolkit 12.x or 13.x
-* NVIDIA GPU (CC 6.0+)
-* Base dependencies: `numba>=0.60.0`, `numpy`, `cuda-pathfinder>=1.2.3`, `cuda-core`, `typing_extensions`
-* CUDA extras: `cuda-bindings` + `cuda-toolkit` + `numba-cuda` via `cuda-cccl[cu12]` or `cuda-cccl[cu13]`
-
-### Usage Examples
-
-```python
-import cuda.compute
-result = cuda.compute.reduce_into(input_array, output_scalar, init_val, binary_op)
-
-import cuda.coop._experimental as coop
-@cuda.jit
-def kernel(data):
-    coop.block.reduce(data, binary_op)
-
-import cuda.cccl.headers as headers
-include_paths = headers.get_include_paths()
-```
-
-### Build and Test
-
-```bash
-./ci/build_cuda_cccl_python.sh -py-version 3.10
-./ci/test_cuda_compute_python.sh -py-version 3.10
-./ci/test_cuda_coop_python.sh -py-version 3.10
-./ci/test_cuda_cccl_headers_python.sh -py-version 3.10
-./ci/test_cuda_cccl_examples_python.sh -py-version 3.10
-```
-
-Test organization:
-
-* `tests/compute` — Algorithms and iterators
-* `tests/coop` — Cooperative primitives
-* `tests/headers` — Header integration
-* `test_examples.py` — Runs compute/coop examples
-
----
-
-## SASS Diffs
-
-Use this test when asked to check for SASS changes between commits, branches or a local changeset.
-
-### Goal
-
-Detect relevant changes in generated CUDA machine code (i.e. SASS) while filtering noise from addresses, symbols, metadata, etc.
-Any non-trivial change must be detected.
-
-### Inputs to establish
-
-* Compiled binary under test
-* The CUDA SM architectures to compile for. Try to detect this from the code and offer the user a list of suggestions.
-  The user must conform or provide this list.
-* Baseline disassembly (from the previous commit/branch or the current commit without the changes in the working copy).
-* Comparison disassembly (form the current commit/branch or the current commit with the changes in the working copy).
-* By default, prefer `cuobjdump -sass` to inspect SASS changes.
-  Use `cuobjdump -ptx` if the request is to check for PTX changes instead.
-
-### Normalization rules (strip known noise)
-
-Apply these transforms to both baseline and candidate listings before diffing.
-Write the normalized listings to separate files.
-
-* Remove addresses/offsets/hex location prefixes.
-* Remove build IDs, timestamps, absolute paths, temp directories, and compiler banners.
-* Normalize whitespace and alignment to single spaces.
-* Remove empty lines and purely comment lines.
-
-### Comparison rules (what matters)
-
-Ignore as trivial:
-
-* Register renaming with identical instruction sequence and operands.
-* Pure label renumbering or reordering of identical basic blocks.
-* Formatting-only differences or reordered symbol tables.
-
-### Reporting
-
-* If any non-trivial change was detected, the top 5 regions where a non-trivial change was detected,
-  including the name of the kernel they appeared in.
-* A short summary of the diff type (opcode change, memory access size change, size delta, control-flow, etc.).
-* Explicitly state if only noise was detected after normalization.
-* If you are not sure if the differences are impactful, show it and ask the user for guidance.
-* Keep the disassembly dumps available for reference and show the command to the user to generate a diff.
-
----
-
-## Continuous Integration (CI)
-
-See `ci-overview.md` for detailed examples and troubleshooting guidance.
-
-CCCL's CI is built on GitHub Actions and relies on a dynamically generated job matrix plus several helper scripts.
-
-### Key Components
-
-* **`ci/matrix.yaml`**
-
-  * Declares build and test jobs for `pull_request`, `nightly`, and `weekly` workflows.
-  * Pull request (PR) runs typically spawn ~250 jobs.
-  * To reduce overhead, you can add an override matrix in `workflows.override`. This limits the PR CI run to a targeted subset of jobs. Overrides are recommended when:
-    * Changes touch high-dependency areas (e.g. top-level CI/devcontainers, libcudacxx, thrust, CUB). See `ci/inspect_changes.py` for dependency information.
-    * A smaller subset of jobs is enough to validate the change (e.g. infra changes, targeted fixes).
-  * Important rules:
-    * PR merges are blocked while an override matrix is active.
-    * The override must be reset to empty (not removed) before merging.
-    * Only add overrides when starting a new draft that qualifies; never remove one without being asked.
-
-* **`.github/actions/workflow-build/`**
-
-  * Runs `build-workflow.py`.
-  * Reads `ci/matrix.yaml` and prunes jobs using `ci/inspect_changes.py`.
-  * Calls `prepare-workflow-dispatch.py` to produce a formatted job matrix for dispatch.
-
-* **`.github/actions/workflow-run-job-{linux,windows}/`**
-
-  * Runs a single matrix job inside a devcontainer.
-
-* **`.github/actions/workflow-results/`**
-
-  * Aggregates artifacts and results.
-  * Marks workflow as failed if any job fails or an override matrix is present.
-
-* **`.github/workflows/ci-workflow-{pull-request,nightly,weekly}.yml`**
-
-  * Top-level GitHub Actions workflows invoking CI.
-
-* **`ci/inspect_changes.py`**
-
-  * Detects which subprojects changed between commits.
-  * Defines internal dependencies between CCCL projects. If a project is marked dirty, all dependent projects are also marked dirty and tested.
-  * Allows `build-workflow.py` to skip unaffected jobs.
-
----
-
-### Commit Message Controls
-
-Tags appended to the commit summary (case-sensitive) control CI behavior:
-
-* `[bench-only]`: Skip all non-benchmark CI jobs. Equivalent to `[skip-matrix][skip-vdc][skip-docs][skip-tpt]`.
-* `[skip-matrix]`: Skip CCCL project build/test jobs. (Docs, devcontainers, and third-party builds still run.)
-* `[skip-vdc]`: Skip "Verify Devcontainer" jobs. Safe unless CI or devcontainer infra is modified.
-* `[skip-docs]`: Skip doc tests/previews. Safe if docs are unaffected.
-* `[skip-third-party-testing]` / `[skip-tpt]`: Skip third-party smoke tests (MatX, PyTorch, RAPIDS).
-* `[skip-matx]`: Skip building the MatX third-party smoke test.
-* `[skip-pytorch]`: Skip building the PyTorch third-party smoke test.
-* `[skip-rapids]`: Skip building the RAPIDS third-party smoke test.
-
-> ⚠️ All of these tags block merging until removed and a full CI run (with no overrides) succeeds.
-
-Use these tags for early iterations to save resources. Remove them before review/merge.
-
----
-
-## Code Formatting and Linting
-
-> ⚠️ Always run before committing. CI will fail otherwise.
-
-```bash
-pip install pre-commit
-pre-commit install
-pre-commit run --all-files
-pre-commit run --files <file1> <file2>
-```
-
----
-
-## General Guidelines
-
-* Validate changes with builds/tests; report results.
-* Run `pre-commit` before committing.
-* Review `CONTRIBUTING.md` and `ci-overview.md` before starting work.
-
-### Performance Tips
-
-* Use development containers with `sccache` (CCCL team only).
-* Limit architectures to reduce compile time (e.g. `-arch "native"` or `"80"` if no GPU).
-* Build with Ninja for fast, parallel builds.
-
-
----
-
-## Repository Structure
+## Repository layout
 
 ```
 cccl/
-├── .github/            # Workflows
-├── .devcontainer/      # Dev containers
-├── libcudacxx/         # CUDA C++ Standard Library
-├── cub/                # CUB primitives
-├── thrust/             # Thrust algorithms
-├── cudax/              # Experimental features
-├── c/                  # C Parallel library
-├── python/cuda_cccl/   # Python bindings
-├── ci/                 # Build/test scripts
-├── examples/           # Usage examples
-└── CMakePresets.json   # Preset configurations
-```
-
-Python package layout:
-
-```
-python/cuda_cccl/
-├── cuda/
-│   ├── compute/
-│   ├── coop/
-│   └── cccl/
-│       ├── parallel/
-│       ├── cooperative/
-│       └── headers/
-├── tests/
-├── benchmarks/
-└── pyproject.toml
-```
-
----
-
-⚠️ **Reminder:** Long-running builds/tests are normal. Never cancel them; allow to complete.
+├── .agent/skills/      <- canonical skills (one dir per skill)
+├── .agent/agents/      <- canonical agents (one file per agent)
+├── .claude/skills      -> ../.agent/skills (directory symlink)
+├── .claude/agents      -> ../.agent/agents (directory symlink)
+├── .claude/settings.json
+├── .devcontainer/      <- Docker containers for reproducible builds
+├── .github/            <- workflows, copy-pr-bot
+├── libcudacxx/         <- CUDA C++ Standard Library
+├── cub/                <- CUB primitives
+├── thrust/             <- Thrust algorithms
+├── cudax/              <- experimental features
+├── c/                  <- C Parallel library
+├── python/cuda_cccl/   <- Python bindings
+├── ci/                 <- build/test scripts + matrix.yaml
+├── docs/               <- Sphinx documentation source
+├── examples/           <- usage examples
+├── AGENTS.md           <- this file
+├── CLAUDE.md           -> AGENTS.md (symlink)
+└── CMakePresets.json
+```
+
+`.agent/` is canonical; `.claude/skills` and `.claude/agents` symlink to it so both Claude Code and Codex find the
+same files.
+
+## Skill routing
+
+The `cccl` skill carries the full table. Common entries:
+
+- Commit uncommitted changes / wrap up a fix → `cccl-commit`
+- Resplit / clean up a branch's commit history → `cccl-resplit-branch`
+- Open / edit / comment on a PR / trigger CI → `cccl-pr`
+- CI overview / matrix / skip tags / `/ok to test` → `cccl-ci`
+- Triage failed CI → `cccl-triage-pr` or `cccl-triage-nightly`
+- Benchmarks → `cccl-ci-benchmarks`
+- Git bisect → `cccl-bisect`
+- Devcontainers → `cccl-devcontainers`
+- Targeted build/test (fast iteration) → `cccl-build-and-test-targets`
+- Full-matrix C++ build/test scripts → `cccl-cpp-builds`
+- Python packages (cuda-cccl) → `cccl-python`
+- libcudacxx code style → `cccl-libcudacxx-style`
+- SASS / PTX comparison → `cccl-sass-diff`
+- Stuck on a decision → `cccl-clarify`
+
+Reference docs: `CONTRIBUTING.md`, `ci-overview.md`, `docs/cccl/development/`.
+
+## Known agent limitations
+
+- Long-running builds (60+ min) and tests (30+ min) are normal — never cancel them. Use
+  `cccl-build-and-test-targets` for fast iteration.
+
+## Pre-commit
+
+Run `pre-commit run --files <files>` before committing. CI's linters will fail otherwise.
+
+## Reset before final merge
+
+These files block merging in their non-default state:
+
+- Non-empty `workflows.override` in `ci/matrix.yaml` → reset to empty.
+- `[skip-*]` tags in the last commit message → remove and re-push.
+- Modified `ci/bench.yaml` → restore to match `ci/bench.template.yaml`.
diff --git a/CLAUDE.md b/CLAUDE.md
deleted file mode 100644
index d3dc773962b..00000000000
--- a/CLAUDE.md
+++ /dev/null
@@ -1,6 +0,0 @@
-# Claude Code Instructions
-
-This repository uses `AGENTS.md` files for agent instructions.
-Before doing anything else, find and index all `AGENTS.md` files in this repository.
-Read them before working on any adjacent code or systems.
-All instructions in `AGENTS.md` files take precedence over any defaults. Follow them exactly.
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 120000
index 00000000000..47dc3e3d863
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1 @@
+AGENTS.md
\ No newline at end of file

From ab6256b158386fe9c4fff39aa23996917007445b Mon Sep 17 00:00:00 2001
From: Allison Piper <alliepiper16@gmail.com>
Date: Tue, 12 May 2026 15:53:41 -0400
Subject: [PATCH 2/7] Ignore .venv/

Generated when the agent venv-installs pre-commit per AGENTS.md's
"Pre-commit" section. Untracked venvs noise up `git status` and
risk accidental staging.
---
 .gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitignore b/.gitignore
index 7c6803f0c62..0f2e6f72a2f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -16,6 +16,7 @@ CMakeUserPresets.json
 *.pyc
 __pycache__
 *.pyd
+.venv/
 wheelhouse/
 bench-artifacts/
 CLAUDE.local.md

From 2fec3fd674f7ce5fb38bae321901690188430d88 Mon Sep 17 00:00:00 2001
From: Allison Piper <alliepiper16@gmail.com>
Date: Tue, 12 May 2026 15:54:06 -0400
Subject: [PATCH 3/7] Document pre-commit auto-fix re-staging in cccl-commit

Pre-commit hooks like pretty-format-json, end-of-file-fixer,
trim-trailing-whitespace, and ruff format rewrite files in place.
On failure with auto-fixes applied, the skill now routes each
fixed file through cccl-clarify (re-stage / revert / discuss) -
the same flow as the per-chunk action menu - rather than
bulk-staging the fixes. Also notes the venv-install fallback for
when pre-commit is absent from the host.
---
 .agent/skills/cccl-commit/SKILL.md | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/.agent/skills/cccl-commit/SKILL.md b/.agent/skills/cccl-commit/SKILL.md
index fa39252d9ce..5cf5f517b73 100644
--- a/.agent/skills/cccl-commit/SKILL.md
+++ b/.agent/skills/cccl-commit/SKILL.md
@@ -70,8 +70,17 @@ skip the test gate unless asked, go to 5.2.
 
 ### 5.1 Tests
 
-`cccl-clarify` → skip / `pre-commit run --files <staged>` / dispatch `cccl-build-and-test-targets`. On failure:
-investigate / commit anyway / abort.
+`cccl-clarify` → skip / `pre-commit run --files <staged>` / dispatch `cccl-build-and-test-targets`. If
+`pre-commit` is absent, venv-install it (`python3 -m venv .venv && .venv/bin/pip install pre-commit`).
+
+Many pre-commit hooks auto-fix in place (`pretty-format-json`, `end-of-file-fixer`,
+`trim-trailing-whitespace`, `ruff format`). On failure with auto-fixes applied:
+1. Show the resulting `git diff` per fixed file.
+2. For each file, route through `cccl-clarify` — re-stage / revert / discuss — same flow as Step 4's per-chunk
+   action menu. Never bulk-`git add` the fixes.
+3. Re-run `pre-commit run --files <staged>` to confirm clean.
+
+Other failures: investigate / commit anyway / abort via `cccl-clarify`.
 
 ### 5.2 Commit message
 

From 434c795aacaea633f6a9be56df84332597a22995 Mon Sep 17 00:00:00 2001
From: Allison Piper <alliepiper16@gmail.com>
Date: Tue, 12 May 2026 16:09:29 -0400
Subject: [PATCH 4/7] Add CI-scoping reminders to cccl-commit and cccl-pr

---
 .agent/skills/cccl-commit/SKILL.md | 11 ++++++++---
 .agent/skills/cccl-pr/SKILL.md     |  4 ++++
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/.agent/skills/cccl-commit/SKILL.md b/.agent/skills/cccl-commit/SKILL.md
index 5cf5f517b73..388186ee2de 100644
--- a/.agent/skills/cccl-commit/SKILL.md
+++ b/.agent/skills/cccl-commit/SKILL.md
@@ -68,6 +68,13 @@ the per-group expected set. STOP on divergence.
 Commit-only with no Split / no Interactive: confirm staged set via `git diff --cached --stat` (empty → exit),
 skip the test gate unless asked, go to 5.2.
 
+### 5.0a Optional CI scoping (last commit only)
+
+Before drafting the last commit's message, route through `cccl-clarify`: offer to scope the next CI run via
+`cccl-ci-overrides` — override matrix (writes `workflows.override` into `ci/matrix.yaml`; re-stage + re-run
+pre-commit) and/or `[skip-*]` tags on the last commit's last line. Both block merge — remind the user to reset
+before final merge.
+
 ### 5.1 Tests
 
 `cccl-clarify` → skip / `pre-commit run --files <staged>` / dispatch `cccl-build-and-test-targets`. If
@@ -106,9 +113,7 @@ prompt). Verify with `git show -p HEAD`: SHA, subject, file list match expectati
 After each commit, `cccl-clarify` → continue / pause / end. On continue, verify remaining slices still apply
 (`git apply --check` per remaining slice); regenerate the patch and re-plan if any fail.
 
-Remind caller to use `cccl-ci-overrides` to setup a minimal CI run if needed.
-
-Last group → final summary (all SHAs, deferred, reverted) and exit.
+Last group → final summary (all SHAs, deferred, reverted) and exit. (CI scoping was offered in Step 5.0a.)
 
 ## Hard prohibitions
 
diff --git a/.agent/skills/cccl-pr/SKILL.md b/.agent/skills/cccl-pr/SKILL.md
index cf1b6adca3b..7a39f6c8fc0 100644
--- a/.agent/skills/cccl-pr/SKILL.md
+++ b/.agent/skills/cccl-pr/SKILL.md
@@ -8,6 +8,10 @@ description: "Manage CCCL pull requests — open a new draft PR after commits la
 CCCL PR lifecycle. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Never force-pushes;
 never deletes branches; never closes/merges PRs.
 
+**Merge-blocker check** — before every push or PR-open operation, detect non-empty `workflows.override` in
+`ci/matrix.yaml` and any `[skip-*]` tags on HEAD's commit message. Both block merge. Surface via `cccl-clarify`
+as a reminder — typically fine for in-progress work, but must be reset before final merge.
+
 ## Step 1 — Resolve mode
 
 `cccl-clarify` (or infer from phrasing):

From 11b0173326b8637091b86a26fdce0eb36b637517 Mon Sep 17 00:00:00 2001
From: Allison Piper <alliepiper16@gmail.com>
Date: Tue, 12 May 2026 16:18:10 -0400
Subject: [PATCH 5/7] Ignore .claude/ in CI dep classification

---
 ci/project_files_and_dependencies.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ci/project_files_and_dependencies.yaml b/ci/project_files_and_dependencies.yaml
index b7bd75510da..7372d59a031 100644
--- a/ci/project_files_and_dependencies.yaml
+++ b/ci/project_files_and_dependencies.yaml
@@ -225,6 +225,7 @@ ignore_regexes:
   # changes.
   #
   # - '\.clang-tidy'
+  - '\.claude/'
   - '\.devcontainer/img'
   - '\.git-blame-ignore-revs'
   - '\.github/actions/docs-build'

From a539995e388103bf54e44b65d60de4d5d980dbf7 Mon Sep 17 00:00:00 2001
From: Allison Piper <alliepiper16@gmail.com>
Date: Thu, 14 May 2026 10:58:35 -0400
Subject: [PATCH 6/7] Mk II

---
 .agent/agents/cccl-ci-fetch-failures.md       |  92 ++++++++++++
 .agent/agents/cccl-ci-overrides.md            | 132 +++++++----------
 .agent/agents/cccl-ci-summarize-job-log.md    | 103 +++++++++++++
 .agent/agents/cccl-fetch-ci-failures.md       |  53 -------
 .agent/agents/cccl-ok-to-test.md              |  53 -------
 .agent/agents/cccl-summarize-job-log.md       |  59 --------
 .agent/skills/cccl-agent-impl/SKILL.md        |  54 -------
 .agent/skills/cccl-bench/SKILL.md             |  63 ++++++++
 .../cccl-bench/references/ci-bench-request.md |  75 ++++++++++
 .agent/skills/cccl-bench/references/docs.md   |  13 ++
 .../skills/cccl-bench/references/local-run.md |  73 +++++++++
 .../cccl-bench/references/nvbench-template.md | 109 ++++++++++++++
 .agent/skills/cccl-bench/references/tools.md  |  17 +++
 .agent/skills/cccl-bench/references/tuning.md |  58 ++++++++
 .agent/skills/cccl-bisect/SKILL.md            |  35 +++--
 .agent/skills/cccl-bisect/references/docs.md  |  11 ++
 .../references/git_bisect_usage.md            | 111 ++++++++++++++
 .agent/skills/cccl-bisect/references/tools.md |  14 ++
 .../cccl-build-and-test-targets/SKILL.md      |  73 ---------
 .agent/skills/cccl-build/SKILL.md             |  85 +++++++++++
 .../skills/cccl-build/references/arch-flag.md |  22 +++
 .../build_and_test_targets_usage.md           | 100 +++++++++++++
 .../references/build_common.sh_usage.md       |  98 ++++++++++++
 .agent/skills/cccl-build/references/docs.md   |  21 +++
 .agent/skills/cccl-build/references/tools.md  |  31 ++++
 .agent/skills/cccl-c/SKILL.md                 | 102 +++++++++++++
 .agent/skills/cccl-c/references/tools.md      |  13 ++
 .agent/skills/cccl-ci-benchmarks/SKILL.md     |  55 -------
 .agent/skills/cccl-ci/SKILL.md                |  62 +++++---
 .agent/skills/cccl-ci/references/docs.md      |  19 +++
 .agent/skills/cccl-ci/references/tools.md     |  24 +++
 .agent/skills/cccl-clarify/SKILL.md           |  42 +++---
 .../cccl-clarify/references/breakdown-flow.md |  43 ++++++
 .agent/skills/cccl-cmake/SKILL.md             | 109 ++++++++++++++
 .agent/skills/cccl-commit/SKILL.md            |  95 +++++-------
 .../references/commit-message-rules.md        |  47 ++++++
 .../references/pre-commit-autofix.md          |  44 ++++++
 .../references/walkthrough-rules.md           |  37 +++++
 .agent/skills/cccl-cpp-builds/SKILL.md        |  53 -------
 .agent/skills/cccl-cub/SKILL.md               | 121 +++++++++++++++
 .agent/skills/cccl-cub/references/docs.md     |  25 ++++
 .agent/skills/cccl-cub/references/tools.md    |   9 ++
 .../cccl-cub/references/tuning-policies.md    |  85 +++++++++++
 .agent/skills/cccl-cudax/SKILL.md             |  94 ++++++++++++
 .agent/skills/cccl-cudax/references/docs.md   |  21 +++
 .agent/skills/cccl-cudax/references/tools.md  |   9 ++
 .agent/skills/cccl-devcontainer/SKILL.md      |  52 +++++++
 .../cccl-devcontainer/references/docs.md      |  12 ++
 .../references/launch_usage.md                |  83 +++++++++++
 .../references/regenerate.md                  |  39 +++++
 .../cccl-devcontainer/references/tools.md     |  16 ++
 .agent/skills/cccl-devcontainers/SKILL.md     |  58 --------
 .agent/skills/cccl-docs/SKILL.md              |  94 ++++++++++++
 .../references/doxygen-breathe-gotchas.md     |  36 +++++
 .agent/skills/cccl-infra/SKILL.md             |  65 ++++++++
 .../cccl-infra/references/compiler-bump.md    |  64 ++++++++
 .../skills/cccl-infra/references/ctk-bump.md  |  65 ++++++++
 .agent/skills/cccl-infra/references/docs.md   |  21 +++
 .../cccl-infra/references/project-add.md      |  89 +++++++++++
 .../cccl-infra/references/release-cut.md      |  61 ++++++++
 .agent/skills/cccl-infra/references/tools.md  |  28 ++++
 .agent/skills/cccl-libcudacxx-style/SKILL.md  |  94 ------------
 .agent/skills/cccl-libcudacxx/SKILL.md        |  74 ++++++++++
 .../skills/cccl-libcudacxx/references/docs.md |  26 ++++
 .../references/style/headers.md               |  84 +++++++++++
 .../references/style/macros.md                |  69 +++++++++
 .../references/style/naming.md                |  53 +++++++
 .../references/style/templates.md             |  46 ++++++
 .../references/style/testing.md               |  65 ++++++++
 .../references/style/visibility.md            |  53 +++++++
 .../cccl-libcudacxx/references/tools.md       |   9 ++
 .agent/skills/cccl-pr/SKILL.md                |  86 ++++++-----
 .agent/skills/cccl-precommit/SKILL.md         | 102 +++++++++++++
 .agent/skills/cccl-python/SKILL.md            |  34 ++---
 .agent/skills/cccl-python/references/docs.md  |  20 +++
 .agent/skills/cccl-python/references/tools.md |  18 +++
 .agent/skills/cccl-resplit-branch/SKILL.md    |  97 +++++-------
 .agent/skills/cccl-sass-diff/SKILL.md         |   3 +-
 .agent/skills/cccl-test/SKILL.md              |  89 +++++++++++
 .agent/skills/cccl-test/references/docs.md    |  19 +++
 .agent/skills/cccl-test/references/tools.md   |  33 +++++
 .agent/skills/cccl-thrust/SKILL.md            | 103 +++++++++++++
 .agent/skills/cccl-thrust/references/docs.md  |  20 +++
 .../references/execution-policies.md          |  71 +++++++++
 .agent/skills/cccl-thrust/references/tools.md |   9 ++
 .agent/skills/cccl-triage-nightly/SKILL.md    |  42 ------
 .agent/skills/cccl-triage-pr/SKILL.md         |  85 -----------
 .agent/skills/cccl-triage/SKILL.md            | 107 ++++++++++++++
 .../skills/cccl-triage/references/common.md   |  56 +++++++
 .../skills/cccl-triage/references/nightly.md  |  47 ++++++
 .agent/skills/cccl-triage/references/pr.md    |  35 +++++
 .agent/skills/cccl/SKILL.md                   |  64 ++++----
 .agent/skills/cccl/references/docs.md         |  27 ++++
 .../cccl/references/skills-and-agents.md      |  92 ++++++++++++
 .agent/skills/cccl_detail-ci/SKILL.md         | 139 ++++++++++++++++++
 .../cccl_detail-ci/references/copy-pr-bot.md  |  83 +++++++++++
 .../skills/cccl_detail-ci/references/docs.md  |  17 +++
 .../references/inspect-changes.md             | 104 +++++++++++++
 .../references/inspect_changes_usage.md       |  75 ++++++++++
 .../references/matrix-expansion.md            |  82 +++++++++++
 .../skills/cccl_detail-ci/references/tools.md |  15 ++
 .agent/skills/cccl_detail-cmake/SKILL.md      | 110 ++++++++++++++
 .../references/arch-flags.md                  |  42 ++++++
 .../references/custom-commands.md             | 109 ++++++++++++++
 .../references/downstream-consumers.md        |  64 ++++++++
 .agent/skills/cccl_detail-cpp-macros/SKILL.md | 137 +++++++++++++++++
 .../references/compiler-detection.md          |  97 ++++++++++++
 .../references/diagnostics.md                 |  94 ++++++++++++
 .../references/visibility-abi.md              |  74 ++++++++++
 .../cccl_detail-devcontainer-matrix/SKILL.md  |  75 ++++++++++
 .../references/tools.md                       |  13 ++
 .agent/skills/cccl_detail-examples/SKILL.md   |  93 ++++++++++++
 .../cccl_detail-examples/references/docs.md   |  21 +++
 .agent/skills/cccl_detail-github/SKILL.md     | 108 ++++++++++++++
 .../cccl_detail-github/references/docs.md     |  24 +++
 .agent/skills/cccl_detail-release/SKILL.md    |  88 +++++++++++
 .../cccl_detail-release/references/docs.md    |  18 +++
 .../skills/cccl_detail-test-params/SKILL.md   |  84 +++++++++++
 AGENTS.md                                     |  39 ++---
 119 files changed, 6034 insertions(+), 1050 deletions(-)
 create mode 100644 .agent/agents/cccl-ci-fetch-failures.md
 create mode 100644 .agent/agents/cccl-ci-summarize-job-log.md
 delete mode 100644 .agent/agents/cccl-fetch-ci-failures.md
 delete mode 100644 .agent/agents/cccl-ok-to-test.md
 delete mode 100644 .agent/agents/cccl-summarize-job-log.md
 delete mode 100644 .agent/skills/cccl-agent-impl/SKILL.md
 create mode 100644 .agent/skills/cccl-bench/SKILL.md
 create mode 100644 .agent/skills/cccl-bench/references/ci-bench-request.md
 create mode 100644 .agent/skills/cccl-bench/references/docs.md
 create mode 100644 .agent/skills/cccl-bench/references/local-run.md
 create mode 100644 .agent/skills/cccl-bench/references/nvbench-template.md
 create mode 100644 .agent/skills/cccl-bench/references/tools.md
 create mode 100644 .agent/skills/cccl-bench/references/tuning.md
 create mode 100644 .agent/skills/cccl-bisect/references/docs.md
 create mode 100644 .agent/skills/cccl-bisect/references/git_bisect_usage.md
 create mode 100644 .agent/skills/cccl-bisect/references/tools.md
 delete mode 100644 .agent/skills/cccl-build-and-test-targets/SKILL.md
 create mode 100644 .agent/skills/cccl-build/SKILL.md
 create mode 100644 .agent/skills/cccl-build/references/arch-flag.md
 create mode 100644 .agent/skills/cccl-build/references/build_and_test_targets_usage.md
 create mode 100644 .agent/skills/cccl-build/references/build_common.sh_usage.md
 create mode 100644 .agent/skills/cccl-build/references/docs.md
 create mode 100644 .agent/skills/cccl-build/references/tools.md
 create mode 100644 .agent/skills/cccl-c/SKILL.md
 create mode 100644 .agent/skills/cccl-c/references/tools.md
 delete mode 100644 .agent/skills/cccl-ci-benchmarks/SKILL.md
 create mode 100644 .agent/skills/cccl-ci/references/docs.md
 create mode 100644 .agent/skills/cccl-ci/references/tools.md
 create mode 100644 .agent/skills/cccl-clarify/references/breakdown-flow.md
 create mode 100644 .agent/skills/cccl-cmake/SKILL.md
 create mode 100644 .agent/skills/cccl-commit/references/commit-message-rules.md
 create mode 100644 .agent/skills/cccl-commit/references/pre-commit-autofix.md
 create mode 100644 .agent/skills/cccl-commit/references/walkthrough-rules.md
 delete mode 100644 .agent/skills/cccl-cpp-builds/SKILL.md
 create mode 100644 .agent/skills/cccl-cub/SKILL.md
 create mode 100644 .agent/skills/cccl-cub/references/docs.md
 create mode 100644 .agent/skills/cccl-cub/references/tools.md
 create mode 100644 .agent/skills/cccl-cub/references/tuning-policies.md
 create mode 100644 .agent/skills/cccl-cudax/SKILL.md
 create mode 100644 .agent/skills/cccl-cudax/references/docs.md
 create mode 100644 .agent/skills/cccl-cudax/references/tools.md
 create mode 100644 .agent/skills/cccl-devcontainer/SKILL.md
 create mode 100644 .agent/skills/cccl-devcontainer/references/docs.md
 create mode 100644 .agent/skills/cccl-devcontainer/references/launch_usage.md
 create mode 100644 .agent/skills/cccl-devcontainer/references/regenerate.md
 create mode 100644 .agent/skills/cccl-devcontainer/references/tools.md
 delete mode 100644 .agent/skills/cccl-devcontainers/SKILL.md
 create mode 100644 .agent/skills/cccl-docs/SKILL.md
 create mode 100644 .agent/skills/cccl-docs/references/doxygen-breathe-gotchas.md
 create mode 100644 .agent/skills/cccl-infra/SKILL.md
 create mode 100644 .agent/skills/cccl-infra/references/compiler-bump.md
 create mode 100644 .agent/skills/cccl-infra/references/ctk-bump.md
 create mode 100644 .agent/skills/cccl-infra/references/docs.md
 create mode 100644 .agent/skills/cccl-infra/references/project-add.md
 create mode 100644 .agent/skills/cccl-infra/references/release-cut.md
 create mode 100644 .agent/skills/cccl-infra/references/tools.md
 delete mode 100644 .agent/skills/cccl-libcudacxx-style/SKILL.md
 create mode 100644 .agent/skills/cccl-libcudacxx/SKILL.md
 create mode 100644 .agent/skills/cccl-libcudacxx/references/docs.md
 create mode 100644 .agent/skills/cccl-libcudacxx/references/style/headers.md
 create mode 100644 .agent/skills/cccl-libcudacxx/references/style/macros.md
 create mode 100644 .agent/skills/cccl-libcudacxx/references/style/naming.md
 create mode 100644 .agent/skills/cccl-libcudacxx/references/style/templates.md
 create mode 100644 .agent/skills/cccl-libcudacxx/references/style/testing.md
 create mode 100644 .agent/skills/cccl-libcudacxx/references/style/visibility.md
 create mode 100644 .agent/skills/cccl-libcudacxx/references/tools.md
 create mode 100644 .agent/skills/cccl-precommit/SKILL.md
 create mode 100644 .agent/skills/cccl-python/references/docs.md
 create mode 100644 .agent/skills/cccl-python/references/tools.md
 create mode 100644 .agent/skills/cccl-test/SKILL.md
 create mode 100644 .agent/skills/cccl-test/references/docs.md
 create mode 100644 .agent/skills/cccl-test/references/tools.md
 create mode 100644 .agent/skills/cccl-thrust/SKILL.md
 create mode 100644 .agent/skills/cccl-thrust/references/docs.md
 create mode 100644 .agent/skills/cccl-thrust/references/execution-policies.md
 create mode 100644 .agent/skills/cccl-thrust/references/tools.md
 delete mode 100644 .agent/skills/cccl-triage-nightly/SKILL.md
 delete mode 100644 .agent/skills/cccl-triage-pr/SKILL.md
 create mode 100644 .agent/skills/cccl-triage/SKILL.md
 create mode 100644 .agent/skills/cccl-triage/references/common.md
 create mode 100644 .agent/skills/cccl-triage/references/nightly.md
 create mode 100644 .agent/skills/cccl-triage/references/pr.md
 create mode 100644 .agent/skills/cccl/references/docs.md
 create mode 100644 .agent/skills/cccl/references/skills-and-agents.md
 create mode 100644 .agent/skills/cccl_detail-ci/SKILL.md
 create mode 100644 .agent/skills/cccl_detail-ci/references/copy-pr-bot.md
 create mode 100644 .agent/skills/cccl_detail-ci/references/docs.md
 create mode 100644 .agent/skills/cccl_detail-ci/references/inspect-changes.md
 create mode 100644 .agent/skills/cccl_detail-ci/references/inspect_changes_usage.md
 create mode 100644 .agent/skills/cccl_detail-ci/references/matrix-expansion.md
 create mode 100644 .agent/skills/cccl_detail-ci/references/tools.md
 create mode 100644 .agent/skills/cccl_detail-cmake/SKILL.md
 create mode 100644 .agent/skills/cccl_detail-cmake/references/arch-flags.md
 create mode 100644 .agent/skills/cccl_detail-cmake/references/custom-commands.md
 create mode 100644 .agent/skills/cccl_detail-cmake/references/downstream-consumers.md
 create mode 100644 .agent/skills/cccl_detail-cpp-macros/SKILL.md
 create mode 100644 .agent/skills/cccl_detail-cpp-macros/references/compiler-detection.md
 create mode 100644 .agent/skills/cccl_detail-cpp-macros/references/diagnostics.md
 create mode 100644 .agent/skills/cccl_detail-cpp-macros/references/visibility-abi.md
 create mode 100644 .agent/skills/cccl_detail-devcontainer-matrix/SKILL.md
 create mode 100644 .agent/skills/cccl_detail-devcontainer-matrix/references/tools.md
 create mode 100644 .agent/skills/cccl_detail-examples/SKILL.md
 create mode 100644 .agent/skills/cccl_detail-examples/references/docs.md
 create mode 100644 .agent/skills/cccl_detail-github/SKILL.md
 create mode 100644 .agent/skills/cccl_detail-github/references/docs.md
 create mode 100644 .agent/skills/cccl_detail-release/SKILL.md
 create mode 100644 .agent/skills/cccl_detail-release/references/docs.md
 create mode 100644 .agent/skills/cccl_detail-test-params/SKILL.md

diff --git a/.agent/agents/cccl-ci-fetch-failures.md b/.agent/agents/cccl-ci-fetch-failures.md
new file mode 100644
index 00000000000..561a8c05ab3
--- /dev/null
+++ b/.agent/agents/cccl-ci-fetch-failures.md
@@ -0,0 +1,92 @@
+---
+name: cccl-ci-fetch-failures
+description: "Fetch failed jobs from a CCCL CI run — given a PR# or run ID, returns TSV of `<job-id>\\t<full-name>\\t<grouping-hint>` at a caller-specified path. Handles `gh api --paginate` slurp gotcha. Non-interactive, read-only. Called by `cccl-triage`."
+model: haiku
+color: cyan
+tools: Bash, Read
+---
+
+You are a non-interactive read-only `cccl-ci-fetch-failures` agent. The caller has a PR number or workflow run ID and wants a TSV of failed jobs for downstream summarization or override-matrix generation. You never modify files beyond writing the named output TSV and a raw-API scratch file, never call `AskUserQuestion`, never spawn subagents.
+
+---
+
+## FOR THE CALLING AGENT — What you must provide
+
+1. **One of `pr: <PR#>` or `run: <RUN_ID>`** — selects the workflow run.
+2. **`output: <path>`** — TSV destination.
+3. **`scratch: <dir>`** — for raw API responses (nests under caller's sessionid: `/tmp/claude/<caller-sid>/<subtask>/`).
+4. **Working directory** — absolute path; `pwd` to confirm.
+
+Missing any → return `under-briefed: <what's missing>`.
+
+## Workflow
+
+### 1. Resolve run ID
+
+If `pr:` given:
+- `gh pr view <PR#> --repo NVIDIA/cccl --json headRefName,headRefOid` → `BRANCH`, `HEAD_SHA`.
+- `gh run list --repo NVIDIA/cccl --branch <BRANCH> --limit 5 --json databaseId,headSha,conclusion` → pick the latest entry where `headSha == HEAD_SHA`. No match → `STATUS: UNDER_BRIEFED, reason: no_run_for_head`.
+- `RUN_ID = databaseId`.
+
+Avoid `gh pr view --json statusCheckRollup` — returns 100k+ tokens on CCCL PRs.
+
+### 2. Fetch jobs
+
+```
+gh api repos/NVIDIA/cccl/actions/runs/<RUN_ID>/jobs?per_page=100 --paginate > <scratch>/jobs_raw.json
+```
+
+`--paginate` concatenates objects; subsequent `jq` needs `-s`.
+
+### 3. Extract failures
+
+```
+jq -s -r '[.[].jobs[] | select(.conclusion == "failure")] | .[] | [.id, .name] | @tsv' \
+   <scratch>/jobs_raw.json > <scratch>/failed_jobs_raw.tsv
+```
+
+Empty → `STATUS: NO_FAILURES`. Write an empty file at `<output>`.
+
+### 4. Append grouping hints
+
+Per row, parse the name and append a tab-separated `<toolchain>|<project>|<variant>`:
+- Toolchain: `[CTK<X> <COMPILER><VER> C++<STD>]` substring.
+- Project: CUB / libcudacxx / Thrust / cudax / Python.
+- Variant: Build / Test / HostLaunch / DeviceLaunch / TestNoLaunch / etc.
+
+Example row:
+
+```
+74849038365	[CTK13.2 GCC15 C++20] cudax TestNoLaunch(amd64)	CTK13.2 GCC15 C++20|cudax|TestNoLaunch
+```
+
+Write to `<output>`.
+
+## Output
+
+```
+STATUS: OK | NO_FAILURES | UNDER_BRIEFED
+
+run_id: <RUN_ID>
+total_failures: <N>
+
+tally:
+  <toolchain>|<project>|<variant>: <count>
+  ...
+
+output_path: <output>
+```
+
+## Stop conditions
+
+- Missing `pr:` and `run:` → `STATUS: UNDER_BRIEFED`.
+- No failed jobs → `STATUS: NO_FAILURES`.
+- `gh api` non-zero exit → return raw stderr, `STATUS: UNDER_BRIEFED`.
+
+## Hard prohibitions
+
+- No `AskUserQuestion`. Not available; not applicable.
+- No spawning subagents. You are a leaf.
+- No file mutations beyond the named output paths.
+
+Universal bash rules are auto-injected — never restate.
diff --git a/.agent/agents/cccl-ci-overrides.md b/.agent/agents/cccl-ci-overrides.md
index 7ab63386ee1..43c91c766fa 100644
--- a/.agent/agents/cccl-ci-overrides.md
+++ b/.agent/agents/cccl-ci-overrides.md
@@ -1,119 +1,75 @@
 ---
 name: cccl-ci-overrides
-description: "Use this agent when a caller skill wants to limit CCCL CI cost on a PR via `workflows.override` matrix entries and/or `[skip-*]` commit tags. Typical triggers include cccl-triage-pr building a targeted-repro override after diagnosing failures, cccl-triage-nightly building one with `for_workflow: nightly`, and commit-prep flows asking \"what override + skip tags fit this diff?\". Takes working changes (paths or diff range) and/or a list of failed-job names; returns override snippet + skip tags + per-decision rationale. Knows `ci/project_files_and_dependencies.yaml`, `ci/matrix.yaml`, and `ci-overview.md`. Non-interactive. See \"When to invoke\" in the agent body for worked scenarios."
+description: "CCCL CI cost limiter — generates `workflows.override` matrix entries and `[skip-*]` tags from failed-job names and/or changed paths. Honors `ci/inspect_changes.py` and `ci-overview.md`. Non-interactive, read-only. Called by `cccl-triage`, `cccl-commit`."
 model: sonnet
 color: magenta
 tools: Bash, Read, Grep
 ---
 
-# cccl-ci-overrides
+You are a non-interactive read-only `cccl-ci-overrides` agent. The caller has paths or a diff range, and/or a list of failed-job names, and wants the minimum override matrix plus safe skip tags that target those jobs. You never modify files, never call `AskUserQuestion`, never spawn subagents.
 
-Advise on CI cost-limiting measures — override matrix entries and skip tags.
+---
+
+## FOR THE CALLING AGENT — What you must provide
 
-## When to invoke
+1. **One of `paths:` (newline-separated changed paths) or `diff_range: <BASE>..<HEAD>`** — drives skip-tag and dirty-project analysis.
+2. **`failed_jobs:`** (path to a file with failed-job names, one per line) — drives direct-reproduction override entries.
+3. **`for_workflow:`** — `pull_request` (default) | `pull_request_lite` | `nightly` | `weekly`.
+4. **Working directory** — absolute path; `pwd` to confirm.
 
-- **Targeted repro from failed jobs.** Triage skill diagnosed failures and wants the minimum override matrix that
-  reproduces them on a subsequent CI run.
-- **Diff-driven override.** Commit-prep flow has a set of changed paths (or a diff range) and wants to know which
-  matrix entries are needed and which `[skip-*]` tags are safe.
-- **Combined input.** Both failed-job list and changed paths; the agent unions and de-dupes the entries.
+At least one of `paths` / `diff_range` / `failed_jobs` required. Missing all three → return `under-briefed: no inputs`.
 
 ## Sources of truth
 
-- `ci/project_files_and_dependencies.yaml` — project definitions, `include_regexes`, `exclude_regexes`,
-  `exclude_project_files`, `lite_dependencies`, `full_dependencies`, global `ignore_regexes`. `core` is special:
-  any unmatched non-ignored file marks `core` dirty → full rebuild.
-- `ci/matrix.yaml` — `workflows.override` schema (see top-of-file examples). Workflow sections: `pull_request`,
-  `pull_request_lite`, `nightly`, `weekly`, `python-wheels`, `devcontainers`. Plus `exclude:` rules, `jobs:`
-  catalogue (job-key → `name:`), `projects:` catalogue, `tags:` defaults (notably
-  `project: { default: ['libcudacxx', 'cub', 'thrust'] }`).
+- `ci/project_files_and_dependencies.yaml` — project definitions, `include_regexes`, `exclude_regexes`, `exclude_project_files`, `lite_dependencies`, `full_dependencies`, global `ignore_regexes`. `core` is special — any unmatched non-ignored file marks `core` dirty → full rebuild.
+- `ci/matrix.yaml` — `workflows.override` schema (top-of-file examples). Workflow sections: `pull_request`, `pull_request_lite`, `nightly`, `weekly`, `python-wheels`, `devcontainers`. Plus `exclude:` rules, `jobs:` catalogue (job-key → `name:`), `projects:` catalogue, `tags:` defaults.
 - `ci-overview.md` — canonical `[skip-*]` tokens.
 
-## Tool to lean on
-
-`ci/inspect_changes.py --refs <BASE> <HEAD>` (or `--file`, `--stdin`) already implements the dep-graph trace and
-honors `ignore_regexes` + `exclude_*` rules. Prefer it over re-implementing.
-
-## Inputs
-
-Any combination of:
-
-- `paths:` (newline-separated changed paths) OR `diff_range: <BASE>..<HEAD>` — drives override + skip-tag
-  analysis.
-- `failed_jobs:` (path to file with failed-job names, one per line) — drives direct-reproduction override.
-- `for_workflow:` — `pull_request` (default) | `pull_request_lite` | `nightly` | `weekly`.
+## Workflow
 
-At least one of `paths`/`diff_range`/`failed_jobs` required.
+### 1. From changes
 
-## Override matrix — from changes
+`ci/inspect_changes.py --refs <BASE> <HEAD>` (or `--file`, `--stdin`) implements the dep-graph trace + honors `ignore_regexes` + `exclude_*`. Prefer it over reimplementing.
 
-1. Run `ci/inspect_changes.py` to classify dirty projects.
-2. From `for_workflow`'s section, pull entries that name a dirty project (or omit `project:` and the default set
-   intersects dirty).
-3. Subtract `exclude:` matches.
-4. Emit as override entries.
+For each entry in `for_workflow`'s section that names a dirty project (or omits `project:` and the default set intersects dirty), subtract `exclude:` matches, emit as override entries.
 
-## Override matrix — from failed jobs
+### 2. From failed jobs
 
-1. Parse each name: `[CTK<X> <COMPILER><VER> C++<STD>] <Project> <JobName>(<Arch>)`. Cross-reference `jobs:` in
-   matrix.yaml to map `<JobName>` (e.g. `BuildHostLaunch`, `TestNoLaunch`, `NVRTC`) → job key (e.g. `build_lid0`,
-   `test_nolid`, `nvrtc`).
-2. Build the minimum override entry per name — `{jobs: [<key>], project: <name>, std: <std>, ctk: <ctk>,
-   cxx: <cxx>, gpu: <gpu if test>}`.
-3. Merge entries sharing `(project, jobs)`; combine `std`/`ctk`/`cxx` into lists.
+Parse each name: `[CTK<X> <COMPILER><VER> C++<STD>] <Project> <JobName>(<Arch>)`. Cross-reference `jobs:` in `matrix.yaml` to map `<JobName>` (e.g. `BuildHostLaunch`, `TestNoLaunch`, `NVRTC`) → job key (e.g. `build_lid0`, `test_nolid`, `nvrtc`).
 
-## Combining inputs
+Build the minimum override entry per name: `{jobs: [<key>], project: <name>, std: <std>, ctk: <ctk>, cxx: <cxx>, gpu: <gpu if test>}`. Merge entries sharing `(project, jobs)`; combine `std`/`ctk`/`cxx` into lists.
 
-If caller provides both, union the entries. De-dupe.
+### 3. Combine and emit
 
-## Snippet format
+Union entries from both inputs, de-dupe. If `workflows.override:` is already non-empty in `ci/matrix.yaml`, emit as **additions** — caller decides whether to append or replace.
 
-```yaml
-# Targeted repro of <source>. Reset before merging.
-- {jobs: ['build'], project: 'libcudacxx', std: 'all', ctk: ['12.0', '12.X'], cxx: ['gcc8', 'gcc9', 'gcc10']}
-- {jobs: ['build'], project: 'cub',        std: 17,    ctk: ['12.0', '12.X'], cxx: ['gcc8']}
-```
+For targeted repro via `build_and_test_targets.sh`, prefer the `target` project pattern from `matrix.yaml`'s top-of-file example.
 
-`<source>` = nightly run ID / PR check context / `<diff_range>` / "manual triage".
-
-For targeted repro via `build_and_test_targets.sh`, prefer the `target` project pattern from matrix.yaml's
-top-of-file example:
-
-```yaml
-- { jobs: ['run_gpu'], project: 'target', ctk: ['13.X'], cxx: 'gcc', gpu: 'rtxa6000',
-    args: '--preset cub-cpp20 --build-targets "cub.cpp20.test.iterator" --ctest-targets "cub.cpp20.test.iterator"' }
-```
-
-If `workflows.override:` is already non-empty, emit as **additions** — caller decides whether to append or
-replace.
-
-## Skip tags (path-based)
+### 4. Skip tags
 
 For each `[skip-*]` token in `ci-overview.md`, suggest if no changed path matches the area it protects:
 
-| Tag              | Suggest when no changed path matches          |
-|------------------|-----------------------------------------------|
-| `[skip-docs]`    | `docs/`, `*.rst`                              |
-| `[skip-vdc]`     | `.devcontainer/`, `ci/`, `.github/workflows/` |
-| `[skip-tpt]`     | third-party canary triggers                   |
-| `[skip-rapids]`  | RAPIDS paths (subset of tpt)                  |
-| `[skip-matx]`    | MatX paths (subset of tpt)                    |
-| `[skip-pytorch]` | PyTorch paths (subset of tpt)                 |
-| `[skip-matrix]`  | no CCCL build/test code (rare — docs/CI-only) |
+| Tag              | Suggest when no changed path matches           |
+|------------------|------------------------------------------------|
+| `[skip-docs]`    | `docs/`, `*.rst`                               |
+| `[skip-vdc]`     | `.devcontainer/`, `ci/`, `.github/workflows/`  |
+| `[skip-tpt]`     | third-party canary triggers                    |
+| `[skip-rapids]`  | RAPIDS paths (subset of tpt)                   |
+| `[skip-matx]`    | MatX paths (subset of tpt)                     |
+| `[skip-pytorch]` | PyTorch paths (subset of tpt)                  |
+| `[skip-matrix]`  | no CCCL build/test code (rare — docs/CI-only)  |
 
-Changes purely within `workflows.override:` target CI scope, not CI infra — don't withhold `[skip-vdc]` for them.
-Paths matching `ignore_regexes` already don't trigger CI — exclude in both directions.
-
-Note that the skip tags only apply to the last commit in a branch; save them until the end if making multiple
-commits.
+Changes purely within `workflows.override:` target CI scope, not CI infra — don't withhold `[skip-vdc]` for them. Paths matching `ignore_regexes` already don't trigger CI — exclude in both directions. Skip tags apply only to the last commit — save them until the final commit in a series.
 
 ## Output
 
 ```
+STATUS: OK | EMPTY | UNDER_BRIEFED
+
 ## Override matrix snippet (insert under `workflows.override:`)
 
 ```yaml
-# <source>. Reset before merging.
+# Targeted repro of <source>. Reset before merging.
 <entries>
 ```
 
@@ -128,4 +84,18 @@ commits.
 - Inputs: <inspect_changes.py summary, failed-job count>
 ```
 
-Omit "Override matrix snippet" if no entries; omit "Skip tags" if no `paths`/`diff_range` given.
+`<source>` = nightly run ID / PR check context / `<diff_range>` / "manual triage". Omit "Override matrix snippet" if no entries; omit "Skip tags" if no `paths`/`diff_range` given.
+
+## Stop conditions
+
+- Missing all three of `paths`/`diff_range`/`failed_jobs` → `STATUS: UNDER_BRIEFED`.
+- `inspect_changes.py` fails → return raw stderr, `STATUS: UNDER_BRIEFED`.
+- All entries produced are empty (clean diff, no failed jobs) → `STATUS: EMPTY`.
+
+## Hard prohibitions
+
+- No `AskUserQuestion`. Not available; not applicable.
+- No spawning subagents. You are a leaf.
+- No file mutations. Read-only.
+
+Universal bash rules are auto-injected — never restate.
diff --git a/.agent/agents/cccl-ci-summarize-job-log.md b/.agent/agents/cccl-ci-summarize-job-log.md
new file mode 100644
index 00000000000..e43f2251f9f
--- /dev/null
+++ b/.agent/agents/cccl-ci-summarize-job-log.md
@@ -0,0 +1,103 @@
+---
+name: cccl-ci-summarize-job-log
+description: "Summarize one downloaded CCCL CI job log — returns first real error, failing step, the exact failing command-line with compiler/linker flags, 5–20 lines of raw error output around the failure, and code/infra/flaky/unknown classification. Input is a local log path. Non-interactive, read-only. Called by `cccl-triage`."
+model: haiku
+color: cyan
+tools: Bash, Read, Grep
+---
+
+You are a non-interactive read-only `cccl-ci-summarize-job-log` agent. The caller has one downloaded CCCL CI job log and wants a digest of the first real error, the failing step, the exact failing command-line with its compiler/linker flags, 5–20 lines of raw error output verbatim, infra-vs-code classification, and any CCCL-specific flag worth surfacing. You never modify files, never call `AskUserQuestion`, never spawn subagents.
+
+---
+
+## FOR THE CALLING AGENT — What you must provide
+
+1. **`log: <path>`** — absolute path to the downloaded job log (typically `/tmp/claude/<caller-sid>/job_<JID>.log`).
+2. **`context: <one-line hint>`** (optional) — job name + toolchain. Surfaces in output if given.
+3. **Working directory** — absolute path; `pwd` to confirm.
+
+Missing `log:` → return `under-briefed: missing log path`. Log does not exist → return `under-briefed: log not found`.
+
+## Workflow
+
+### 1. Find the first real error
+
+Grep for `error|FAIL|exit code|##\[error\]` (case-insensitive). Read context around each hit. Retries of the same error → pick the underlying cause, not the retry.
+
+### 2. Identify the failing step
+
+GHA logs prefix each step with a `##[group]` banner; the command appears immediately below (often with `+` from `set -x`).
+
+### 3. Capture the failing command
+
+The `+ <cmd>` line (or `##[group]Run …` block) immediately preceding the error is the exact invocation that
+failed — capture it verbatim, including every compiler / linker / CMake flag, architecture flag, `-std=`,
+`-D` define, include path, and the source file. Downstream triage relies on the full command-line, so do
+not truncate.
+
+### 4. Capture raw error output
+
+Reproduce **5–20 lines** of the log around the first real error, verbatim — no paraphrasing, no
+ellipses inside a line. Trim only outer noise (timestamps, group banners). Include:
+
+- Compiler/linker diagnostics with their `file:line:column:` prefixes.
+- The full error message and the template instantiation chain (`required from here`, `note:` chains).
+- For test failures: the assertion message, expected/actual values, and stack frames if present.
+- For infra failures: the relevant runner output (OOM trace, network timeout, container pull failure).
+
+Aim toward the upper end (15–20 lines) when the diagnostic includes template instantiation chains or
+multi-line assertion output; trim toward 5 lines only when the error is genuinely a single line.
+
+### 5. Classify
+
+- **`code`** — real failure: compile error, test assertion, link error, runtime crash from CCCL code.
+- **`infra`** — network, artifact upload/download, container pull, runner crash, OOM, disk full, timeout on the runner.
+- **`flaky`** — known-flaky test; the rest of the run otherwise succeeded.
+- **`unknown`** — cannot classify confidently.
+
+### 6. CCCL-specific flags
+
+Surface only if useful for downstream triage:
+- Specific toolchain combo (informs `cccl-ci-overrides` matrix).
+- Cluster of related failures (e.g. all `cudax TestNoLaunch` on one CTK).
+- Path naming a recently-introduced change.
+
+## Output
+
+Emit the following structure (the inner ``` fences are literal — keep them in your output):
+
+    STATUS: OK | UNDER_BRIEFED
+
+    **Job:** <context or log basename>
+    **Class:** code | infra | flaky | unknown
+
+    **Failing step:** <step name>
+
+    **Failing command** (log line <N>):
+    ```
+    <verbatim command-line, including all compiler / linker / CMake / -arch / -std / -D / -I flags>
+    ```
+
+    **Raw error output** (log lines <M>–<M+k>):
+    ```
+    <5–20 lines verbatim from the log around the first real error>
+    ```
+
+    **CCCL flags:**
+      - <observation>
+
+The verbatim **Failing command** and **Raw error output** blocks are the deliverable — keep them faithful to the log. Surrounding prose stays short.
+
+## Stop conditions
+
+- Missing `log:` → `STATUS: UNDER_BRIEFED`.
+- Log path does not exist → `STATUS: UNDER_BRIEFED`.
+- No errors detected in log → `STATUS: OK`, class = `unknown`, with note in CCCL flags.
+
+## Hard prohibitions
+
+- No `AskUserQuestion`. Not available; not applicable.
+- No spawning subagents. You are a leaf.
+- No file mutations.
+
+Universal bash rules are auto-injected — never restate.
diff --git a/.agent/agents/cccl-fetch-ci-failures.md b/.agent/agents/cccl-fetch-ci-failures.md
deleted file mode 100644
index 73fbf267f94..00000000000
--- a/.agent/agents/cccl-fetch-ci-failures.md
+++ /dev/null
@@ -1,53 +0,0 @@
----
-name: cccl-fetch-ci-failures
-description: "Use this agent when a caller skill needs the list of failed jobs from a CCCL CI run, given either a PR number or a workflow run ID. Typical triggers include cccl-triage-pr collecting failures for the current branch's PR, cccl-triage-nightly collecting failures for the latest scheduled nightly run, and any other skill that needs failed-job TSV output for downstream summarization or override-matrix generation. Output is a TSV at a caller-specified path with one row per failed job: `<job-id>\\t<full-name>\\t<grouping-hint>`. Handles `gh api --paginate` and the `jq -s` slurp gotcha. Non-interactive. See \"When to invoke\" in the agent body for worked scenarios."
-model: haiku
-color: cyan
-tools: Bash, Read
----
-
-# cccl-fetch-ci-failures
-
-Return failed jobs from a CCCL CI run as TSV.
-
-## When to invoke
-
-- **Triage-PR fetch.** A PR-triage skill has the PR number and needs a TSV of failed jobs to pick representatives
-  for log-fetching. Caller hands over PR#, output path, scratch dir.
-- **Triage-nightly fetch.** A nightly-triage skill has the workflow run ID (resolved from
-  `gh run list --workflow=ci-workflow-nightly.yml`) and needs the same TSV. Caller hands over run ID, output path,
-  scratch dir.
-
-## Inputs
-
-One of:
-- `pr: <PR#>` — latest run on the PR.
-- `run: <RUN_ID>` — specific workflow run.
-
-Plus `output: <path>` and `scratch: <dir>`. Missing any → abort.
-
-## Steps
-
-1. **Resolve the run ID.** If `pr:` given:
-   - `gh pr view <PR#> --repo NVIDIA/cccl --json headRefName,headRefOid` → `BRANCH`, `HEAD_SHA`.
-   - `gh run list --repo NVIDIA/cccl --branch <BRANCH> --limit 5 --json databaseId,headSha,conclusion` → pick the
-     latest entry whose `headSha == HEAD_SHA`. No match → abort.
-   - `RUN_ID = databaseId` from that entry.
-
-   Avoid `gh pr view --json statusCheckRollup` — it returns 100k+ tokens on CCCL PRs.
-2. **Fetch jobs.** `gh api repos/NVIDIA/cccl/actions/runs/<RUN_ID>/jobs?per_page=100 --paginate` into
-   `<scratch>/jobs_raw.json`. `--paginate` concatenates objects; subsequent `jq` needs `-s`.
-3. **Extract failures.** `jq -s -r '[.[].jobs[] | select(.conclusion == "failure")] | .[] | [.id, .name] | @tsv'`
-   into `<scratch>/failed_jobs_raw.tsv`. Empty → return zero-failures.
-4. **Append grouping hints.** Per row, parse the name and append `<toolchain>|<project>|<variant>`:
-   - Toolchain: `[CTK<X> <COMPILER><VER> C++<STD>]` substring.
-   - Project: CUB / libcudacxx / Thrust / cudax / Python.
-   - Variant: Build / Test / HostLaunch / DeviceLaunch / TestNoLaunch / etc.
-
-   Example row:
-   ```
-   74849038365	[CTK13.2 GCC15 C++20] cudax TestNoLaunch(amd64)	CTK13.2 GCC15 C++20|cudax|TestNoLaunch
-   ```
-
-   Write to `<output>`.
-5. **Return summary** — count + tally of the third column.
diff --git a/.agent/agents/cccl-ok-to-test.md b/.agent/agents/cccl-ok-to-test.md
deleted file mode 100644
index be831cc13e0..00000000000
--- a/.agent/agents/cccl-ok-to-test.md
+++ /dev/null
@@ -1,53 +0,0 @@
----
-name: cccl-ok-to-test
-description: "Use this agent when a caller skill has pushed a commit to a CCCL PR's branch and wants to trigger CI by posting the copy-pr-bot `/ok to test <SHA>` comment. Typical triggers include cccl-triage-pr after a fix commit lands on an existing PR, cccl-triage-nightly after opening a new draft PR for a nightly fix, and any caller that needs the SHA-verification gate (local HEAD vs remote PR head) before posting. The agent verifies the local SHA matches the remote head, aborts on mismatch, posts the comment, and suggests the caller schedule a 20-minute polling loop. Non-interactive. Never pushes, never creates PRs, never force-pushes — the caller owns all of those decisions. See \"When to invoke\" in the agent body for worked scenarios."
-model: haiku
-color: yellow
-tools: Bash, Read
----
-
-# cccl-ok-to-test
-
-Verify local-vs-remote SHA for a CCCL PR; post `/ok to test <SHA>`.
-
-## When to invoke
-
-- **PR-triage CI restart.** Caller has just pushed a fix commit to the existing PR's branch. Agent verifies local
-  HEAD matches remote head, posts `/ok to test <SHA>`, returns the SHA + a polling reminder.
-- **Nightly-triage first CI run.** Caller just created a draft PR for a nightly fix and needs the initial
-  `/ok to test`. Same flow.
-- **Mismatch gate.** Caller (or user) suspects local and remote may have diverged. Agent's first job is to
-  refuse-and-report on mismatch.
-
-## Inputs
-
-1. `<PR#>`
-2. `<OWNER/REPO>` (typically `NVIDIA/cccl`, always explicit)
-3. `<BRANCH>`
-
-Missing → abort naming the field.
-
-## Steps
-
-1. `git rev-parse HEAD` → `LOCAL_SHA`. The only SHA used in the comment; never derived elsewhere.
-2. `gh pr view <PR#> --repo <OWNER/REPO> --json headRefOid,isDraft,headRefName` → `REMOTE_SHA`, `isDraft`,
-   `headRefName`.
-3. `headRefName != <BRANCH>` → abort showing both.
-4. `LOCAL_SHA != REMOTE_SHA` → abort:
-   ```
-   ERROR: local HEAD does not match remote PR head.
-     local:   <LOCAL_SHA>
-     remote:  <REMOTE_SHA>
-   Likely: unpushed commits, or someone else pushed after you.
-   Aborting without posting `/ok to test`.
-   ```
-5. `gh pr comment <PR#> --repo <OWNER/REPO> --body "/ok to test <LOCAL_SHA>"`.
-6. Return:
-   ```
-   Posted `/ok to test <LOCAL_SHA>` on PR #<PR#>. Draft: <isDraft>.
-   Caller: consider `ScheduleWakeup(delaySeconds=1200)` polling on
-   `gh pr checks <PR#>`.
-   ```
-
-Local SHA is the contract — the caller just pushed it. Remote SHA is checked only as a sync gate against
-concurrent pushes.
diff --git a/.agent/agents/cccl-summarize-job-log.md b/.agent/agents/cccl-summarize-job-log.md
deleted file mode 100644
index 77259d7632e..00000000000
--- a/.agent/agents/cccl-summarize-job-log.md
+++ /dev/null
@@ -1,59 +0,0 @@
----
-name: cccl-summarize-job-log
-description: "Use this agent when a caller skill has downloaded a single CCCL CI job log and needs a 5–10 line summary. Typical triggers include cccl-triage-pr or cccl-triage-nightly summarizing one representative log per failure cluster (dispatched in parallel — one agent per log), and any other workflow that wants to digest a job log without loading the full output into orchestrator context. Input is a path to a downloaded job log (typically `/tmp/claude/<sessionid>/job_<JID>.log`). Output covers first real error, failing command/step, stack trace, infra-vs-code classification, and anything CCCL-specific worth flagging. Non-interactive. See \"When to invoke\" in the agent body for worked scenarios."
-model: haiku
-color: cyan
-tools: Bash, Read, Grep
----
-
-# cccl-summarize-job-log
-
-Read one CCCL CI job log; return a tight summary.
-
-## When to invoke
-
-- **Cluster-representative summarization.** A triage skill picked one representative job per failure cluster,
-  fetched logs to `/tmp/claude/<sessionid>/job_<JID>.log`, and dispatches one summarize agent per log in parallel.
-  Each returns first-error, failing-step, infra-vs-code classification.
-- **One-off log digest.** A skill needs to know what's in a single job log (whose path it already has) without
-  reading the full text into orchestrator context.
-
-## Inputs
-
-- `log: <path>` — full path to a downloaded job log.
-- `context: <one-line hint>` (optional) — e.g. job name + toolchain.
-
-Missing `log:` → abort.
-
-## Steps
-
-1. **Find the first real error.** Grep for `error|FAIL|exit code|##[error]` (case-insensitive) and read context
-   around the hits. Ignore retries of the same error — pick the underlying cause.
-2. **Identify the failing step.** GHA logs prefix each step with a `##[group]` banner; the command appears just
-   below (often with `+` from `set -x`).
-3. **Capture the diagnostic.** File:line + 1–2 lines of context for compiler/linker/test failures; step name for
-   infra failures.
-4. **Classify.** `code` (real failure) / `infra` (network, artifact, container pull, runner crash, OOM, timeout) /
-   `flaky` (known-flaky test, rest of run succeeded) / `unknown`.
-5. **CCCL-specific flags.** Specific toolchain combo (useful for `cccl-ci-overrides`), cluster of related
-   failures, path naming a recently-introduced change.
-
-## Output
-
-```
-**Job:** <full name from `context:` or `<log-basename>`>
-**Class:** code | infra | flaky | unknown
-
-**First real error** (log line <N>):
-  <one or two lines>
-
-**Failing step:** <step name>
-
-**Diagnostic:**
-  <2-4 lines with file:line>
-
-**CCCL flags:**
-  - <observation>
-```
-
-≤10 lines of body text.
diff --git a/.agent/skills/cccl-agent-impl/SKILL.md b/.agent/skills/cccl-agent-impl/SKILL.md
deleted file mode 100644
index 67d9aeb56c3..00000000000
--- a/.agent/skills/cccl-agent-impl/SKILL.md
+++ /dev/null
@@ -1,54 +0,0 @@
----
-name: cccl-agent-impl
-description: "How skills and agents work in the CCCL repository. Filesystem layout, invocation, frontmatter, allow-list semantics, intent-driven auto-discovery. Load this skill when you land in the CCCL repo cold and don't know what skills or agents are, when you see references to `.agent/skills` or `.agent/agents` and want to understand them, or when authoring a new CCCL skill or agent."
----
-
-# cccl-agent-impl
-
-## Filesystem
-
-```
-<repo>/.agent/
-  skills/<name>/SKILL.md
-  agents/<name>.md
-
-<repo>/.claude/
-  skills  -> ../.agent/skills    (directory symlink)
-  agents  -> ../.agent/agents    (directory symlink)
-  settings.json
-```
-
-Canonical files live under `.agent/`. Claude Code reads `.claude/skills/` and `.claude/agents/`; Codex reads
-`.agent/`.
-
-## Skills
-
-`.agent/skills/<name>/SKILL.md`. Frontmatter:
-
-```yaml
----
-name: <kebab-case>
-description: "<trigger surface — used for intent matching>"
----
-```
-
-Invoke via the **Skill tool** with `skill: <name>`. Not reentrant.
-
-## Agents
-
-`.agent/agents/<name>.md`. Frontmatter:
-
-```yaml
----
-name: <name>
-description: "<what and when>"
-model: haiku
-tools: Read, Grep, Bash
----
-```
-
-CCCL agents are **non-interactive** — no `AskUserQuestion`. User dialogue belongs in the calling skill (often via
-`cccl-clarify`). Pick `model:` per workload: `haiku` for mechanical tasks (log parsing, jq munging, SHA
-verification); `sonnet` for multi-file reasoning or judgment (e.g. `cccl-ci-overrides`).
-
-Dispatch via the **Agent tool** with `subagent_type: <name>`. The agent runs to completion and returns one message.
diff --git a/.agent/skills/cccl-bench/SKILL.md b/.agent/skills/cccl-bench/SKILL.md
new file mode 100644
index 00000000000..e8b377a134a
--- /dev/null
+++ b/.agent/skills/cccl-bench/SKILL.md
@@ -0,0 +1,63 @@
+---
+description: "CCCL's benchmarking infrastructure — nvbench C++ benchmarks, Python `cuda.bench` bindings, `ci/bench.yaml` PR bench requests, and the `cccl.bench` tuning harness. Triggers: \"benchmark this PR\", \"write a benchmark\", \"request a bench run\", \"compare perf before/after\", \"tune kernel params\"."
+---
+
+# cccl-bench
+
+Orientation for CCCL's benchmark infrastructure: where bench sources live, how to write them, how to run them locally, how to request CI bench comparisons, and how tuning works.
+
+## Source layout
+
+| Location                              | Contents                                                                              |
+|---------------------------------------|----------------------------------------------------------------------------------------|
+| `cub/benchmarks/bench/<algo>/`         | CUB C++ benchmarks (`.cu` per variant, shared `.cuh` base)                            |
+| `python/cuda_cccl/benchmarks/`         | Python benchmarks using `cuda.bench`                                                  |
+| `benchmarks/cmake/CCCLBenchmarkRegistry.cmake` | CMake helpers: `add_bench`, `register_cccl_benchmark`, `register_cccl_tuning` |
+| `benchmarks/scripts/cccl/bench/`       | `cccl.bench` Python tuning harness                                                    |
+| `ci/bench/`                           | CI compare scripts (`bench.sh`, `compare_git_refs.sh`, `compare_paths.sh`)            |
+| `ci/bench.yaml`                       | PR bench-request config (edit to request; must match template to merge)               |
+| `ci/bench.template.yaml`              | Reset target — `ci/bench.yaml` must match this before merging                         |
+| `.github/workflows/bench.yml`          | Benchmark Compare workflow (triggered by `ci/bench.yaml` dispatch)                    |
+
+## Writing a C++ benchmark
+
+C++ benchmarks use [nvbench](https://github.com/NVIDIA/nvBench). The standard pattern: a shared `base.cuh` defines the benchmark function and `NVBENCH_BENCH_TYPES` registration; each `.cu` selects type axes and, optionally, tuning parameter ranges via `%RANGE%` annotations.
+
+See `references/nvbench-template.md` for a minimal template and axis patterns.
+
+## Writing a Python benchmark
+
+Python benchmarks mirror C++ targets and use `cuda.bench` (the Python nvbench binding). Each script registers a benchmark function, declares axes, and calls `bench.run_all_benchmarks(sys.argv)`. Filters in `ci/bench.yaml` match relative paths under `python/cuda_cccl/benchmarks/`.
+
+See `references/nvbench-template.md` for a Python example alongside the C++ one.
+
+## Running benchmarks locally
+
+CUB benchmarks require a Release build with `CMAKE_CUDA_ARCHITECTURES` set. Build target `cub.bench.<algo>.<variant>.base`, then run the binary directly with nvbench flags. The `ci/bench/bench.sh` wrapper handles two-ref comparisons from a single command.
+
+See `references/local-run.md` for build preset, binary invocation, and `bench.sh` usage.
+
+## Requesting a CI bench run
+
+Edit `ci/bench.yaml`, add regex filters and a GPU, add `[bench-only]` to commit messages, push, and CI dispatches `.github/workflows/bench.yml`. Artifacts include per-target JSON, markdown summaries, and a `summary.md`.
+
+Full edit-and-tag flow: `references/ci-bench-request.md`.
+
+## Tuning
+
+The `cccl.bench` Python harness (`benchmarks/scripts/cccl/bench/`) drives kernel parameter search. Tunable benchmarks annotate parameters with `// %RANGE% DEFINE label start:end:step`. CMake generates a `.variant` target alongside the `.base` target when `CUB_ENABLE_TUNING=ON`. The harness builds both, sweeps the parameter space, and scores each variant against the base.
+
+See `references/tuning.md` for the full workflow.
+
+## Reset before merge
+
+`ci/bench.yaml` must match `ci/bench.template.yaml` exactly. CI branch protection fails the diff check; reset the file before the final merge.
+
+## Additional resources
+
+- `references/nvbench-template.md` — C++ and Python benchmark skeletons with axis patterns
+- `references/ci-bench-request.md` — full `ci/bench.yaml` edit flow, GPU pool, and `[bench-only]` tag
+- `references/local-run.md` — local build and run commands, `bench.sh` usage
+- `references/tuning.md` — `%RANGE%` annotation, `CUB_ENABLE_TUNING`, harness invocation
+- `references/docs.md` — index of benchmark documentation.
+- `references/tools.md` — benchmark scripts with purpose and cross-references.
diff --git a/.agent/skills/cccl-bench/references/ci-bench-request.md b/.agent/skills/cccl-bench/references/ci-bench-request.md
new file mode 100644
index 00000000000..45826a5a6b7
--- /dev/null
+++ b/.agent/skills/cccl-bench/references/ci-bench-request.md
@@ -0,0 +1,75 @@
+# CI bench request flow
+
+Bench comparisons run on dedicated GPU runners via `.github/workflows/bench.yml`, dispatched automatically from `ci/bench.yaml`.
+
+## Steps
+
+1. Edit `ci/bench.yaml`. Add regex filters under `benchmarks.filters.cub` and/or `benchmarks.filters.python`. Uncomment at least one GPU under `benchmarks.gpus`.
+
+   ```yaml
+   benchmarks:
+     filters:
+       cub:
+         - '^cub\.bench\.reduce\.(sum|min)\.'
+       python:
+         - 'compute/reduce/sum\.py'
+     gpus:
+       - "rtxa6000"   # sm_86, 48 GB
+   ```
+
+2. Append `[bench-only]` to commit messages while iterating. This suppresses all non-benchmark CI jobs.
+
+   ```
+   [bench-only] tune reduce block size
+   ```
+
+3. Push. CI dispatches a bench job per GPU listed. The job checks out both `base_ref` (default `origin/main`) and `test_ref` (default `HEAD`) via `ci/bench/compare_git_refs.sh`, builds CUB benchmarks in Release mode with sccache, runs targets matching the filters, and compares using `nvbench-compare`.
+
+4. Inspect artifacts: the workflow uploads `bench-artifacts/` with per-target JSON, markdown reports, and a `summary.md`. Job step summaries show collapsed comparison tables.
+
+5. Before final merge, reset `ci/bench.yaml` to match `ci/bench.template.yaml`. Both files must be identical or the branch-protection check fails.
+
+## GPU pool
+
+| Name          | SM      | VRAM    |
+|---------------|---------|---------|
+| `t4`          | sm_75   | 16 GB   |
+| `rtx2080`     | sm_75   | 8 GB    |
+| `rtxa6000`    | sm_86   | 48 GB   |
+| `l4`          | sm_89   | 24 GB   |
+| `rtx4090`     | sm_89   | 24 GB   |
+| `h100`        | sm_90   | 80 GB   |
+| `rtxpro6000`  | sm_120  | —       |
+
+GPU runners are shared. Be intentional — prefer one representative GPU unless architecture-specific behavior is under investigation.
+
+## Filter syntax
+
+CUB filters are regexes matched against ninja target names (`cub.bench.<algo>.<variant>.base`). Examples:
+
+```yaml
+- '^cub\.bench\.copy\.memcpy\.base$'     # exact target
+- '^cub\.bench\.reduce\.(sum|min)\.'      # all reduce sum/min variants
+```
+
+Python filters are regexes matched against relative paths under `python/cuda_cccl/benchmarks/`. Examples:
+
+```yaml
+- 'compute/reduce/sum\.py'
+- 'compute/transform/.*\.py'
+```
+
+## Advanced options
+
+```yaml
+benchmarks:
+  base_ref: "origin/main"       # default; any ref or SHA
+  test_ref: "HEAD"              # default; override to compare arbitrary refs
+  arch: "native"                # CMAKE_CUDA_ARCHITECTURES; "native" detects GPU
+  launch_args: "--cuda 13.2 --host gcc14"  # passed to .devcontainer/launch.sh
+  nvbench_args: >-
+    --timeout 30
+    --skip-time 15e-6
+    --stopping-criterion entropy
+  nvbench_compare_args: ""
+```
diff --git a/.agent/skills/cccl-bench/references/docs.md b/.agent/skills/cccl-bench/references/docs.md
new file mode 100644
index 00000000000..f2a80f48fc0
--- /dev/null
+++ b/.agent/skills/cccl-bench/references/docs.md
@@ -0,0 +1,13 @@
+# Documentation index — cccl-bench
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `ci/bench/README.md` | Benchmark infrastructure overview, `bench.sh` / `compare_git_refs.sh` usage, result comparison workflow. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cub/benchmarking.rst` | CUB benchmarking infrastructure and performance measurement workflows (Sphinx reference). |
diff --git a/.agent/skills/cccl-bench/references/local-run.md b/.agent/skills/cccl-bench/references/local-run.md
new file mode 100644
index 00000000000..074710886ea
--- /dev/null
+++ b/.agent/skills/cccl-bench/references/local-run.md
@@ -0,0 +1,73 @@
+# Local benchmark build and run
+
+## Prerequisites
+
+- Release build. CUB benchmarks fail CMake configuration in non-Release mode.
+- `CMAKE_CUDA_ARCHITECTURES` must be set.
+- A GPU must be available.
+
+## Build
+
+Use a CUB-enabled preset, e.g.:
+
+```bash
+cmake --preset cub-benchmarks   # or equivalent Release preset with CMAKE_CUDA_ARCHITECTURES
+cmake --build build --target cub.bench.reduce.sum.base
+```
+
+All bench targets roll up under `cub.all.benches`. To build every benchmark:
+
+```bash
+cmake --build build --target cub.all.benches
+```
+
+Target naming: `cub.bench.<algo>.<variant>.base` for the baseline; `cub.bench.<algo>.<variant>.variant` when `CUB_ENABLE_TUNING=ON`.
+
+## Run a single binary
+
+```bash
+./build/bin/cub.bench.reduce.sum.base \
+  --stopping-criterion entropy \
+  -d 0
+```
+
+Common nvbench flags:
+
+| Flag                          | Meaning                                                          |
+|-------------------------------|------------------------------------------------------------------|
+| `-d 0`                        | Device index (required; nvbench breaks with multiple visible GPUs) |
+| `--stopping-criterion entropy` | Adaptive stopping (recommended)                                  |
+| `--timeout <s>`               | Per-state timeout                                                 |
+| `--skip-time <s>`             | Skip states faster than this (noise floor)                        |
+| `-a "Elements{io}=[16,20,24]"` | Override a runtime axis                                           |
+| `--jsonbin result.json`       | Write results to JSON                                             |
+| `--jsonlist-benches`          | Print benchmark metadata                                          |
+
+## Compare two refs with bench.sh
+
+`ci/bench/bench.sh` wraps `compare_git_refs.sh`. It checks out each ref in a temporary worktree, builds, runs, and compares. Run from the repo root inside the devcontainer (GPU required):
+
+```bash
+./ci/bench/bench.sh "origin/main" "HEAD" \
+  --arch "native" \
+  --cub-filter "^cub\.bench\.copy\.memcpy\.base$"
+```
+
+With Python filters:
+
+```bash
+./ci/bench/bench.sh "origin/main" "HEAD" \
+  --python-filter "compute/reduce/sum\.py"
+```
+
+Artifacts land under `${CCCL_BENCH_ARTIFACT_ROOT:-./bench-artifacts}/`.
+
+## Compare two already-checked-out trees
+
+```bash
+./ci/bench/compare_paths.sh \
+  "/path/to/base/cccl" \
+  "/path/to/test/cccl" \
+  --arch "native" \
+  --cub-filter "^cub\.bench\.copy\.memcpy\.base$"
+```
diff --git a/.agent/skills/cccl-bench/references/nvbench-template.md b/.agent/skills/cccl-bench/references/nvbench-template.md
new file mode 100644
index 00000000000..972c68eeda5
--- /dev/null
+++ b/.agent/skills/cccl-bench/references/nvbench-template.md
@@ -0,0 +1,109 @@
+# nvbench benchmark templates
+
+## C++ (nvbench)
+
+Minimal structure: a shared `base.cuh` defines the benchmark function and registration macro; `.cu` files select type axes and optionally declare tuning ranges.
+
+`base.cuh`:
+```cpp
+#pragma once
+
+#include <cub/device/device_reduce.cuh>
+#include <nvbench_helper.cuh>
+
+template <typename T, typename OffsetT>
+void my_algo(nvbench::state& state, nvbench::type_list<T, OffsetT>)
+{
+  const auto elements = state.get_int64("Elements{io}");
+
+  thrust::device_vector<T> in = generate(elements);
+  thrust::device_vector<T> out(1);
+
+  state.add_element_count(elements);
+  state.add_global_memory_reads<T>(elements, "Size");
+  state.add_global_memory_writes<T>(1);
+
+  state.exec(nvbench::exec_tag::gpu | nvbench::exec_tag::no_batch,
+    [&](nvbench::launch& launch) {
+      // invoke kernel here
+    });
+}
+
+NVBENCH_BENCH_TYPES(my_algo, NVBENCH_TYPE_AXES(value_types, offset_types))
+  .set_name("base")
+  .set_type_axes_names({"T{ct}", "OffsetT{ct}"})
+  .add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4));
+```
+
+`sum.cu` (variant selecting types; adding tuning ranges):
+```cpp
+#include <nvbench_helper.cuh>
+
+// %RANGE% TUNE_ITEMS_PER_THREAD ipt 7:24:1
+// %RANGE% TUNE_THREADS_PER_BLOCK tpb 128:1024:32
+
+using value_types = all_types;
+using op_t        = ::cuda::std::plus<>;
+#include "base.cuh"
+```
+
+Axis suffix conventions:
+- `{ct}` — compile-time axis (type parameter)
+- `{io}` — runtime axis affecting I/O throughput display
+- No suffix — plain runtime axis
+
+Available type aliases (`nvbench_helper.cuh`): `all_types`, `value_types`, `offset_types`, `integral_types`, `float_types`.
+
+Tuning annotations (`%RANGE%`):
+```
+// %RANGE% DEFINE_NAME short_label start:end:step
+```
+CMake parses these to build `cub.<prefix>.<algo>.variant` alongside the `.base` target.
+
+## Python (`cuda.bench`)
+
+Python benchmarks mirror C++ targets. Filters in `ci/bench.yaml` match relative paths under `python/cuda_cccl/benchmarks/`.
+
+```python
+import sys
+from pathlib import Path
+sys.path.insert(0, str(Path(__file__).parent.parent))
+
+import cupy as cp
+import numpy as np
+from utils import SIGNED_TYPES as TYPE_MAP, as_cupy_stream, generate_data_with_entropy
+
+import cuda.bench as bench
+from cuda.compute import OpKind, make_reduce_into
+
+
+def bench_my_algo(state: bench.State):
+    type_str  = state.get_string("T{ct}")
+    dtype     = TYPE_MAP[type_str]
+    num_items = int(state.get_int64("Elements{io}"))
+
+    alloc_stream = as_cupy_stream(state.get_stream())
+    with alloc_stream:
+        d_in  = generate_data_with_entropy(num_items, dtype, "1.000", alloc_stream)
+        d_out = cp.empty(1, dtype=dtype)
+
+    state.add_element_count(num_items)
+    state.add_global_memory_reads(num_items * d_in.dtype.itemsize, "Size")
+    state.add_global_memory_writes(d_out.dtype.itemsize)
+
+    def launcher(launch: bench.Launch):
+        # invoke op here via launch.get_stream()
+        pass
+
+    state.exec(launcher, batched=False)
+
+
+if __name__ == "__main__":
+    b = bench.register(bench_my_algo)
+    b.set_name("base")
+    b.add_string_axis("T{ct}", list(TYPE_MAP.keys()))
+    b.add_int64_power_of_two_axis("Elements{io}", range(16, 29, 4))
+    bench.run_all_benchmarks(sys.argv)
+```
+
+Python benchmarks output nvbench-compatible JSON consumed by the same `nvbench-compare` tool used for C++.
diff --git a/.agent/skills/cccl-bench/references/tools.md b/.agent/skills/cccl-bench/references/tools.md
new file mode 100644
index 00000000000..7c83d107363
--- /dev/null
+++ b/.agent/skills/cccl-bench/references/tools.md
@@ -0,0 +1,17 @@
+# Tool index — cccl-bench
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose | Detail |
+|------|---------|--------|
+| `ci/bench/bench.sh` | Thin wrapper: compares two git refs by forwarding to `compare_git_refs.sh`. Usage: `bench.sh <base-ref> <test-ref> [compare_paths args...]`. | see `references/local-run.md` |
+| `ci/bench/compare_git_refs.sh` | Builds both refs and compares their benchmark output. Core benchmark comparison driver. | see `references/local-run.md` |
+| `ci/bench/compare_paths.sh` | Compares benchmark results from two pre-built paths (skips build step). | see `references/local-run.md` |
+| `ci/bench/parse_bench_matrix.sh` | Parses `ci/bench.yaml` to extract benchmark job definitions for CI dispatch. | CI-internal; not user-invoked |
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/util/build_and_test_targets.sh` | Used internally by bench scripts to build benchmark targets before running. | `cccl-build` → `references/build_and_test_targets_usage.md` |
+| `.devcontainer/launch.sh` | Wraps bench runs in the devcontainer. | `cccl-devcontainer` → `references/launch_usage.md` |
diff --git a/.agent/skills/cccl-bench/references/tuning.md b/.agent/skills/cccl-bench/references/tuning.md
new file mode 100644
index 00000000000..70eb3b0e27d
--- /dev/null
+++ b/.agent/skills/cccl-bench/references/tuning.md
@@ -0,0 +1,58 @@
+# CUB kernel parameter tuning
+
+## How it works
+
+Tunable benchmarks annotate kernel policy parameters with `%RANGE%` comments. CMake parses these at configure time to build a `.variant` target alongside the `.base` target. The `cccl.bench` Python harness (`benchmarks/scripts/cccl/bench/`) sweeps the parameter space, scores each variant against the base, and stores results in a SQLite database.
+
+## Annotating a benchmark for tuning
+
+In `base.cuh` or the `.cu` file, add `%RANGE%` comments above the tuning `#define`s:
+
+```cpp
+// %RANGE% TUNE_ITEMS_PER_THREAD ipt 7:24:1
+// %RANGE% TUNE_THREADS_PER_BLOCK tpb 128:1024:32
+// %RANGE% TUNE_ITEMS_PER_VEC_LOAD_POW2 ipv 1:2:1
+```
+
+Format: `// %RANGE% <DEFINE> <short_label> <start>:<end>:<step>`
+
+The benchmark must guard tuning-specific code with `#if !TUNE_BASE` / `#endif` so the `.base` target compiles without the tuning parameters.
+
+## Building with tuning enabled
+
+```bash
+cmake -DCUB_ENABLE_TUNING=ON ...
+cmake --build build --target cub.bench.reduce.sum.variant
+```
+
+When `CUB_ENABLE_TUNING=ON`, CMake generates `<build>/cub.bench.reduce.sum.variant.h`. The harness rewrites this header for each parameter combination and rebuilds.
+
+## Running the tuning harness
+
+The harness lives in `benchmarks/scripts/cccl/bench/` and is invoked via `benchmarks/scripts/run.py` (or `search.py` for search-driven tuning). Run from the build directory:
+
+```bash
+cd <build>
+python3 /path/to/benchmarks/scripts/run.py \
+  -R "^cub\.bench\.reduce\.sum$"
+```
+
+Key flags:
+
+| Flag                           | Meaning                                                |
+|--------------------------------|--------------------------------------------------------|
+| `-R <regex>`                   | Select benchmarks by name                              |
+| `-a "Axis=Value"`              | Pin a runtime axis (e.g. `-a "Elements{io}=1048576"`) |
+| `--num-shards N --run-shard K` | Parallel sharding                                      |
+| `-P0`                          | Run P0 (priority 0) subset                             |
+| `-l`                           | List available benchmarks and their variant counts     |
+
+The harness builds `.base` once, then iterates `.variant` targets. Results go into a SQLite database (`cccl_bench.db` by default) keyed by `(ctk, cccl, gpu, variant)`.
+
+## Scoring
+
+Each variant is scored as a weighted sum of speedups over the base across the runtime axis space. Weights are computed by `benchmarks/scripts/cccl/bench/score.py`. The highest-scoring variant for each `(ctk, cccl, gpu)` combination is the tuning winner.
+
+## Interpreting results
+
+`benchmarks/scripts/analyze.py` reads the SQLite database and produces summary tables. `benchmarks/scripts/compare.py` compares two runs (e.g. before and after a tuning change).
diff --git a/.agent/skills/cccl-bisect/SKILL.md b/.agent/skills/cccl-bisect/SKILL.md
index 4da48132993..2707bc095bb 100644
--- a/.agent/skills/cccl-bisect/SKILL.md
+++ b/.agent/skills/cccl-bisect/SKILL.md
@@ -1,29 +1,25 @@
 ---
-name: cccl-bisect
-description: "Run a git bisect on CCCL to identify which commit introduced a regression. Two routes: cloud (dispatch `.github/workflows/git-bisect.yml` via `gh workflow run`, runs in CCCL CI infrastructure on a GPU runner) or local (invoke `ci/util/git_bisect.sh` via `.devcontainer/launch.sh`). Walks the user through preset / build-targets / ctest-targets / lit-tests / good-ref / bad-ref selection. Use when the user has a regression and wants to find the introducing commit. Trigger phrases: \"bisect this regression\", \"find when X broke\", \"git bisect\"."
+description: "Run a git bisect on CCCL to find the commit that introduced a regression. Cloud route: dispatch `.github/workflows/git-bisect.yml` on a GPU runner. Local route: invoke `ci/util/git_bisect.sh` via `.devcontainer/launch.sh`. Walks through preset, build/test targets, good/bad refs. Triggers: \"bisect this regression\", \"find when X broke\", \"git bisect\"."
 ---
 
 # cccl-bisect
 
-Bisects are slow. Restrict build/test targets to the smallest set that reliably reproduces the regression.
+Bisects are slow. Restrict build and test targets to the smallest set that reliably reproduces the regression.
 
 ## Sources of truth
 
-- `.github/workflows/git-bisect.yml` — cloud-dispatch workflow.
-- `ci/util/git_bisect.sh` — local script wrapped by the workflow.
-- `ci/util/build_and_test_targets.sh` — per-commit configure/build/test driver.
-- `docs/cccl/development/build_and_bisect_tools.rst` — full docs.
+- `.github/workflows/git-bisect.yml`
+- `ci/util/git_bisect.sh`
+- `ci/util/build_and_test_targets.sh`
+- `docs/cccl/development/build_and_bisect_tools.rst`
 
 ## Inputs needed
 
-- **`preset`** — CMake preset (e.g. `cub-cpp20`, `thrust-cpp17`, `libcudacxx`, `cudax`). `cmake --list-presets`
-  enumerates them.
+- **`preset`** — CMake preset (e.g. `cub-cpp20`, `thrust-cpp17`, `libcudacxx`, `cudax`). `cmake --list-presets` enumerates them.
 - **`build_targets`** — space-separated ninja targets.
 - **`ctest_targets`** — space-separated CTest `-R` regexes. Optional.
-- **`lit_precompile_tests` / `lit_tests`** — space-separated libcudacxx lit paths relative to
-  `libcudacxx/test/libcudacxx/`. Optional.
-- **`good_ref`** / **`bad_ref`** — commit/tag/branch, or `-Nd` ("N days ago on main", e.g. `-7d`), or empty
-  (defaults: latest release tag / `main`).
+- **`lit_precompile_tests` / `lit_tests`** — space-separated libcudacxx lit paths relative to `libcudacxx/test/libcudacxx/`. Optional.
+- **`good_ref`** / **`bad_ref`** — commit/tag/branch, or `-Nd` ("N days ago on main", e.g. `-7d`), or empty (defaults: latest release tag / `main`).
 - **`cmake_options`** — extra `-D…=…` flags. Optional.
 - **`launch_args`** — extra `--cuda X` / `--host Y` for devcontainer. Optional.
 
@@ -64,10 +60,13 @@ Requires Docker.
     --ctest-targets '<regex>'
 ```
 
-Single long Bash invocation — no `&&` chains.
-
 ## Output
 
-Both routes write a `summary.md` capturing the found-bad commit (hash, author, message), the build/test command
-that distinguishes good from bad, and the bisect log. Cloud route surfaces a "Bisection Results" URL in the GHA
-step summary.
+Both routes write a `summary.md` capturing the found-bad commit (hash, author, message), the build/test command that distinguishes good from bad, and the bisect log. Cloud route surfaces a "Bisection Results" URL in the GHA step summary.
+
+## Additional resources
+
+- `references/docs.md` — index of CCCL bisect documentation.
+- `references/tools.md` — `git_bisect.sh` and cross-referenced build tools.
+- `references/git_bisect_usage.md` — `ci/util/git_bisect.sh` interface and examples.
+- `cccl-build` → `references/build_and_test_targets_usage.md` — build/test flags shared with bisect.
diff --git a/.agent/skills/cccl-bisect/references/docs.md b/.agent/skills/cccl-bisect/references/docs.md
new file mode 100644
index 00000000000..4737c7d65b7
--- /dev/null
+++ b/.agent/skills/cccl-bisect/references/docs.md
@@ -0,0 +1,11 @@
+# Documentation index — cccl-bisect
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cccl/development/build_and_bisect_tools.rst` | Build and bisect tool reference; `git_bisect.sh` options, good/bad ref syntax, summary output format. |
+
+## See also
+
+- `cccl-build` `references/docs.md` for build tool documentation (bisect drives the same build/test commands).
diff --git a/.agent/skills/cccl-bisect/references/git_bisect_usage.md b/.agent/skills/cccl-bisect/references/git_bisect_usage.md
new file mode 100644
index 00000000000..a02f536b53f
--- /dev/null
+++ b/.agent/skills/cccl-bisect/references/git_bisect_usage.md
@@ -0,0 +1,111 @@
+# `ci/util/git_bisect.sh` usage
+
+Automates `git bisect` for CCCL regression isolation. Checks out each candidate commit, runs the
+specified build and test commands (same flag set as `build_and_test_targets.sh`), and produces a
+Markdown summary identifying the first bad commit, the distinguishing command, and the full bisect log.
+
+## Location
+
+`ci/util/git_bisect.sh`. Run from the repo root, inside the devcontainer. GPU required when
+`--ctest-targets` or `--lit-tests` are specified.
+
+## Interface
+
+```
+Usage: ./ci/util/git_bisect.sh [--preset NAME | --configure-override CMD] [options]
+
+Generic Options:
+
+  -h, --help             Show this help and exit
+
+Bisection Options:
+
+  --good-ref STR         Good ref/sha/tag/branch. Defaults to latest release tag.
+                         Accepts '-Nd' (e.g., '-14d') to mean 'origin/main as of N days ago'.
+  --bad-ref STR          Bad ref/sha/tag/branch. Defaults to origin/main.
+                         Accepts '-Nd' (e.g., '-14d') to mean 'origin/main as of N days ago'.
+  --summary-file PATH    Markdown summary output path (optional)
+                         No summary file will be generated if this is omitted.
+
+Build / Test Options:
+
+  --preset NAME             CMake preset
+  --cmake-options STR       Extra options passed to CMake preset configure (optional)
+  --configure-override CMD  Command to run for configuration instead of cmake preset
+                            If set, --preset and --cmake-options will be ignored
+  --build-targets STR       Space separated ninja build targets (optional)
+                            If omitted, no targets will be built -- explicitly specify 'all' if needed.
+  --ctest-targets STR       Space separated CTest -R regex patterns (optional)
+                            If omitted, no tests will be run -- explicitly specify '.' to run all.
+  --lit-precompile-tests STR  Space-separated libcudacxx lit test paths to precompile without execution (optional)
+                              e.g. 'cuda/utility/basic_any.pass.cpp'
+  --lit-tests STR            Space-separated libcudacxx lit test paths to execute (optional)
+                              e.g. 'cuda/utility/basic_any.pass.cpp'
+  --custom-test-cmd CMD     Custom command run after build and tests (optional)
+  --repeat N               Re-run the build/test for passing commits N times (default: 1)
+```
+
+## Options
+
+| Flag                    | Required? | Description                                                               |
+|-------------------------|-----------|---------------------------------------------------------------------------|
+| `--preset`              | Yes*      | CMake preset. Same as `build_and_test_targets.sh`.                        |
+| `--configure-override`  | Yes*      | Shell command replacing `cmake --preset`. Mutually exclusive with preset. |
+| `--good-ref`            | No        | Last-known-good commit/tag/branch. Defaults to latest release tag.        |
+| `--bad-ref`             | No        | First-known-bad commit/tag/branch. Defaults to `origin/main`.             |
+| `--summary-file`        | No        | Path for Markdown output. Omit to skip file generation.                   |
+| `--build-targets`       | No        | Space-separated ninja targets (quoted string).                            |
+| `--ctest-targets`       | No        | Space-separated CTest `-R` regex patterns (quoted string).                |
+| `--lit-precompile-tests`| No        | Lit paths to precompile; relative to `libcudacxx/test/libcudacxx/`.       |
+| `--lit-tests`           | No        | Lit paths to execute; relative to `libcudacxx/test/libcudacxx/`.          |
+| `--cmake-options`       | No        | Extra `-D…` flags for `cmake --preset`.                                   |
+| `--custom-test-cmd`     | No        | Arbitrary command run after build and tests.                              |
+| `--repeat`              | No        | Re-run passing commits N times to guard against flakes. Default: `1`.     |
+
+\* One of `--preset` or `--configure-override` is required.
+
+## Examples
+
+```bash
+# Bisect a CUB test failure against the last 14 days of main
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 --gpus all -- \
+  ./ci/util/git_bisect.sh \
+    --summary-file /tmp/shared/bisect-summary.md \
+    --good-ref '-14d' \
+    --preset 'cub-cpp20' \
+    --build-targets 'cub.cpp20.test.iterator' \
+    --ctest-targets 'cub.cpp20.test.iterator'
+
+# Bisect with explicit good/bad SHAs, no GPU (build-only)
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 -- \
+  ./ci/util/git_bisect.sh \
+    --good-ref 'v2.7.0' \
+    --bad-ref 'main' \
+    --preset 'cub-cpp20' \
+    --build-targets 'cub.cpp20.test.iterator'
+
+# Cloud dispatch via GitHub Actions
+gh workflow run git-bisect.yml --repo NVIDIA/cccl --ref main \
+  -f runner='linux-amd64-gpu-rtxa6000-latest-1' \
+  -f preset='cub-cpp20' \
+  -f build_targets='cub.cpp20.test.iterator' \
+  -f ctest_targets='cub.cpp20.test.iterator' \
+  -f good_ref='-14d' \
+  -f bad_ref='main'
+```
+
+## Wraps / calls
+
+- `ci/util/build_and_test_targets.sh` — for each candidate commit's build+test step
+- `git bisect` — standard git bisect mechanics (start, good, bad, run, reset)
+
+## Notes / gotchas
+
+- Narrow `--build-targets` and `--ctest-targets` to the smallest set that reproduces the failure.
+  Each bisect step is a full configure+build+test; broad targets multiply bisect time significantly.
+- `--good-ref '-Nd'` resolves to the state of `origin/main` as of N days ago, not a local branch.
+- `--repeat N` reruns the test N times on commits that pass, useful when the failure is intermittent.
+- The summary file captures: bad commit hash/author/message, the distinguishing command, and the
+  full `git bisect log` output. Useful for surfacing in PR comments or CI step summaries.
+- Cloud route (`.github/workflows/git-bisect.yml`) produces a "Bisection Results" link in the GHA
+  step summary; prefer it for long bisects to avoid local GPU contention.
diff --git a/.agent/skills/cccl-bisect/references/tools.md b/.agent/skills/cccl-bisect/references/tools.md
new file mode 100644
index 00000000000..41857e751ec
--- /dev/null
+++ b/.agent/skills/cccl-bisect/references/tools.md
@@ -0,0 +1,14 @@
+# Tool index — cccl-bisect
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose | Detail |
+|------|---------|--------|
+| `ci/util/git_bisect.sh` | Automated git bisect: checks out commits, runs build+test via the same flags as `build_and_test_targets.sh`, and reports the first bad commit. | `references/git_bisect_usage.md` |
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/util/build_and_test_targets.sh` | Build/test driver invoked internally by `git_bisect.sh` for each bisect step. | `cccl-build` → `references/build_and_test_targets_usage.md` |
+| `.devcontainer/launch.sh` | Wraps `git_bisect.sh` in the devcontainer for local bisect runs. | `cccl-devcontainer` → `references/tools.md` |
diff --git a/.agent/skills/cccl-build-and-test-targets/SKILL.md b/.agent/skills/cccl-build-and-test-targets/SKILL.md
deleted file mode 100644
index 13a32e58c90..00000000000
--- a/.agent/skills/cccl-build-and-test-targets/SKILL.md
+++ /dev/null
@@ -1,73 +0,0 @@
----
-name: cccl-build-and-test-targets
-description: "Reference for `ci/util/build_and_test_targets.sh` — CCCL's preset-driven configure/build/test driver used by CI, the bisect workflow, and ad-hoc local runs. Covers `--preset`, `--cmake-options`, `--configure-override`, `--build-targets`, `--ctest-targets`, `--lit-precompile-tests`, `--lit-tests`, `--custom-test-cmd`. Use when the user wants to build or test a specific target without running the full CI matrix. Trigger phrases: \"build just X\", \"run test Y\", \"targeted build\", \"how do I run the cub tests\"."
----
-
-# cccl-build-and-test-targets
-
-`ci/util/build_and_test_targets.sh` configures, builds, and tests a CMake preset with the targets you specify.
-Run it from the repo root, inside the devcontainer (or anywhere the preset's compiler is available).
-
-## Flags
-
-| Flag                               | Effect                                                                                          |
-|------------------------------------|-------------------------------------------------------------------------------------------------|
-| `--preset <name>`                  | CMake preset (or use `--configure-override` instead)                                            |
-| `--cmake-options "<flags>"`        | Extra `-D…=…` flags appended to preset configure                                                |
-| `--configure-override "<cmd>"`     | Custom configure command (overrides `--preset` and `--cmake-options`)                           |
-| `--build-targets "<targets>"`      | Space-separated ninja targets. Omit to skip build (`"all"` for everything)                      |
-| `--ctest-targets "<regex>"`        | Space-separated CTest `-R` regexes. Omit to skip tests (`"."` for all)                          |
-| `--lit-precompile-tests "<paths>"` | libcudacxx lit paths to compile without execution (relative to `libcudacxx/test/libcudacxx/`)   |
-| `--lit-tests "<paths>"`            | libcudacxx lit paths to compile AND execute                                                     |
-| `--custom-test-cmd "<cmd>"`        | Arbitrary command after tests                                                                   |
-
-`--build-targets` and `--ctest-targets` are opt-in. Omit → nothing builds or tests; the script just configures.
-
-## Common patterns
-
-Most cases: pick the preset and pass the target as both `--build-targets` and `--ctest-targets`:
-
-```
-ci/util/build_and_test_targets.sh \
-  --preset <preset> \
-  --build-targets "<target>" \
-  --ctest-targets "<target>"
-```
-
-| Project    | Preset(s)                        | Target example                |
-|------------|----------------------------------|-------------------------------|
-| CUB        | `cub-cpp17`, `cub-cpp20`         | `cub.cpp20.test.iterator`     |
-| Thrust     | `thrust-cpp17`, `thrust-cpp20`   | `thrust.cpp20.test.reduce`    |
-| cudax      | `cudax`                          | `cudax.cpp20.test.async_buffer` |
-| C Parallel | `cccl-c-parallel`                | `cccl.c.test.reduce`          |
-
-libcudacxx is lit-driven — use `--lit-precompile-tests` and `--lit-tests` instead of `--build-targets`:
-
-```
-ci/util/build_and_test_targets.sh \
-  --preset libcudacxx \
-  --lit-precompile-tests "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp" \
-  --lit-tests           "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp"
-```
-
-Avoid `--build-targets "libcudacxx.cpp20.precompile.lit"` — it precompiles the entire test suite.
-
-## Output
-
-Build dir at `build/${CCCL_BUILD_INFIX}/${PRESET}/` (parsed from the cmake configure log line
-`-- Build files have been written to:`). Phase-by-phase elapsed time printed with emoji status markers.
-
-## Wrapping in the devcontainer
-
-```
-.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 -- \
-  ./ci/util/build_and_test_targets.sh \
-    --preset cub-cpp20 \
-    --build-targets "cub.cpp20.test.iterator"
-```
-
-## vs full-matrix scripts
-
-- `build_and_test_targets.sh` — single preset, named targets. Fast iteration.
-- `./ci/build_<project>.sh` / `./ci/test_<project>.sh` — full build/test cycles across host/std/arch matrix. Slow.
-  See `cccl-cpp-builds`.
diff --git a/.agent/skills/cccl-build/SKILL.md b/.agent/skills/cccl-build/SKILL.md
new file mode 100644
index 00000000000..d0c36d1caa6
--- /dev/null
+++ b/.agent/skills/cccl-build/SKILL.md
@@ -0,0 +1,85 @@
+---
+description: |
+  CCCL C++ build paths — fast iteration first, full matrix when needed.
+  Covers `ci/util/build_and_test_targets.sh` for single-preset targeted builds
+  and `ci/build_*.sh` for full host/std/arch matrix builds.
+  Triggers: "build just X", "targeted build", "build cub", "build thrust",
+  "full matrix build", "compile cudax".
+---
+
+# cccl-build
+
+Two build paths. Prefer the targeted build_and_test_targets.sh for inner-loop iteration;
+reach for the `ci/{build|test}_*` scripts when you need a complete host/std/arch sweep.
+
+## Fast iteration — `ci/util/build_and_test_targets.sh`
+
+Single wrapper around `cmake`, `ninja`, `ctest`, and `lit` for one preset at a time:
+
+- `cmake` — configure the preset (always runs unless cached)
+- `ninja` — build the named `--build-targets`
+- `ctest` — run the named `--ctest-targets` (regex list)
+- `lit` — run the named `--lit-tests` / `--lit-precompile-tests` (libcudacxx)
+
+This skill covers the configure/build flags. See `cccl-test` for the `ctest` and `lit` runners.
+
+Run from the repo root, inside the devcontainer (or anywhere the preset's compiler is available).
+
+```
+ci/util/build_and_test_targets.sh \
+  --preset <name> \
+  --build-targets "<target>"
+```
+
+Common preset/target pairs:
+
+| Project    | Preset(s)                      | Target example                  |
+|------------|--------------------------------|---------------------------------|
+| CUB        | `cub-cpp17`, `cub-cpp20`       | `cub.cpp20.test.iterator`       |
+| Thrust     | `thrust-cpp17`, `thrust-cpp20` | `thrust.cpp20.test.reduce`      |
+| cudax      | `cudax`                        | `cudax.cpp20.test.async_buffer` |
+| C Parallel | `cccl-c-parallel`              | `cccl.c.test.reduce`            |
+| libcudacxx | `libcudacxx`                   | use `--lit-precompile-tests`    |
+
+Other useful flags: `--cmake-options`, `--configure-override`. Omit `--build-targets` → configure only.
+Build dir: `build/${CCCL_BUILD_INFIX}/${PRESET}/`.
+
+Wrap in the devcontainer:
+
+```
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 -- \
+  ./ci/util/build_and_test_targets.sh \
+    --preset cub-cpp20 \
+    --build-targets "cub.cpp20.test.iterator"
+```
+
+## Full matrix — `ci/build_*.sh`
+
+Per-project scripts that build across a full host/std/arch sweep. No GPU required for build.
+
+```
+./ci/build_<project>.sh  [-cxx <compiler>] [-std <std>] [-arch "<arch-list>"]
+```
+
+| Project    | Script                  | Stds    |
+|------------|-------------------------|---------|
+| CUB        | `build_cub`             | 17, 20  |
+| Thrust     | `build_thrust`          | 17, 20  |
+| libcudacxx | `build_libcudacxx`      | 17, 20  |
+| cudax      | `build_cudax`           | 20 only |
+| C Parallel | `build_cccl_c_parallel` | 17 only |
+
+Architecture flag (`-arch`): semicolon-separated CMake `CUDA_ARCHITECTURES` list.
+`native` or `"80"` is much faster than `all-major-cccl`. See `references/arch-flag.md` for syntax forms.
+
+Full builds: 60+ min. Never cancel mid-run.
+`sccache` is enabled in the devcontainer (CCCL-team bucket auth).
+
+## Additional resources
+
+- `references/arch-flag.md` — architecture flag forms (`<XX>`, `<XX-real>`, `<XX-virtual>`, `native`, `all-major-cccl`)
+- `references/docs.md` — index of CCCL build documentation.
+- `references/tools.md` — all build scripts with purpose and ownership.
+- `references/build_and_test_targets_usage.md` — `ci/util/build_and_test_targets.sh` interface and examples.
+- `references/build_common.sh_usage.md` — `ci/build_common.sh` options, env vars, and helper functions.
+- See `cccl-test` for running tests after a build.
diff --git a/.agent/skills/cccl-build/references/arch-flag.md b/.agent/skills/cccl-build/references/arch-flag.md
new file mode 100644
index 00000000000..a17136a6e20
--- /dev/null
+++ b/.agent/skills/cccl-build/references/arch-flag.md
@@ -0,0 +1,22 @@
+# Architecture flag forms
+
+The `-arch` flag maps to CMake `CUDA_ARCHITECTURES`. Value is a semicolon-separated list.
+
+| Form             | Generates               | Notes                          |
+|------------------|-------------------------|--------------------------------|
+| `<XX>`           | PTX + SASS for SM XX    | e.g. `80`                      |
+| `<XX-real>`      | SASS only               | smaller binary, no JIT         |
+| `<XX-virtual>`   | PTX only                | JIT at runtime                 |
+| `native`         | Detect host GPU         | fastest for local iteration    |
+| `all-major-cccl` | Default for PR builds   | slowest; use only when needed  |
+| `all-cccl`       | A very frustrated user  | Just don't                     |
+
+Examples:
+
+```
+-arch "native"
+-arch "80"
+-arch "70;75;80-virtual"
+```
+
+Limiting `-arch` to `native` or a single SM is the single biggest build-time lever.
diff --git a/.agent/skills/cccl-build/references/build_and_test_targets_usage.md b/.agent/skills/cccl-build/references/build_and_test_targets_usage.md
new file mode 100644
index 00000000000..3e615bec4c6
--- /dev/null
+++ b/.agent/skills/cccl-build/references/build_and_test_targets_usage.md
@@ -0,0 +1,100 @@
+# `ci/util/build_and_test_targets.sh` usage
+
+Unified driver for configure, build, and test in one pass. The inner-loop tool for targeted builds
+and test runs against a single CMake preset. Wraps `cmake`, `ninja`, `ctest`, and `lit` in sequence;
+stops and reports on the first failure with an elapsed-time banner.
+
+## Location
+
+`ci/util/build_and_test_targets.sh`. Run from the repo root, inside the devcontainer (or any
+environment where the preset's compilers are on `PATH`). No GPU required for build-only invocations;
+GPU required for `--ctest-targets` and `--lit-tests`.
+
+## Interface
+
+```
+Usage: ./ci/util/build_and_test_targets.sh [--preset NAME | --configure-override CMD] [options]
+
+Options:
+  -h, --help                Show this help and exit
+  --preset NAME             CMake preset
+  --cmake-options STR       Extra options passed to CMake preset configure (optional)
+  --configure-override CMD  Command to run for configuration instead of cmake preset
+                            If set, --preset and --cmake-options will be ignored
+  --build-targets STR       Space separated ninja build targets (optional)
+                            If omitted, no targets will be built -- explicitly specify 'all' if needed.
+  --ctest-targets STR       Space separated CTest -R regex patterns (optional)
+                            If omitted, no tests will be run -- explicitly specify '.' to run all.
+  --lit-precompile-tests STR  Space-separated libcudacxx lit test paths to precompile without execution (optional)
+                              e.g. 'cuda/utility/basic_any.pass.cpp'
+  --lit-tests STR            Space-separated libcudacxx lit test paths to execute (optional)
+                              e.g. 'cuda/utility/basic_any.pass.cpp'
+  --custom-test-cmd CMD     Custom command run after build and tests (optional)
+```
+
+## Options
+
+| Flag                    | Required? | Description                                                              |
+|-------------------------|-----------|--------------------------------------------------------------------------|
+| `--preset`              | Yes*      | CMake preset name. Mutually exclusive with `--configure-override`.       |
+| `--configure-override`  | Yes*      | Shell command replacing the `cmake --preset` step. Ignores `--preset`.   |
+| `--cmake-options`       | No        | Extra `-D…` flags appended to `cmake --preset`. Space-separated string.  |
+| `--build-targets`       | No        | Ninja targets to build. Space-separated; quoted. Omit = configure only.  |
+| `--ctest-targets`       | No        | CTest `-R` regex patterns. Space-separated; each runs a separate `ctest`.|
+| `--lit-precompile-tests`| No        | Lit paths to precompile (no execution). Relative to `libcudacxx/test/libcudacxx/`. |
+| `--lit-tests`           | No        | Lit paths to execute. Relative to `libcudacxx/test/libcudacxx/`.         |
+| `--custom-test-cmd`     | No        | Arbitrary command run after all other test steps.                        |
+
+\* One of `--preset` or `--configure-override` is required.
+
+## Environment
+
+| Variable            | Default               | Effect                                                          |
+|---------------------|-----------------------|-----------------------------------------------------------------|
+| `CCCL_BUILD_INFIX`  | `""`                  | Subdirectory under `build/` for this devcontainer's artifacts.  |
+
+Build output lands in `build/${CCCL_BUILD_INFIX}/${PRESET}/`. The `build/latest` symlink always points
+to the most recent `build/${CCCL_BUILD_INFIX}/` directory; `build/preset-latest` to the most recent
+preset subdirectory within it.
+
+## Examples
+
+```bash
+# Build a single CUB target
+ci/util/build_and_test_targets.sh \
+  --preset cub-cpp20 \
+  --build-targets "cub.cpp20.test.iterator"
+
+# Build then run a CTest target
+ci/util/build_and_test_targets.sh \
+  --preset cub-cpp20 \
+  --build-targets "cub.cpp20.test.iterator" \
+  --ctest-targets "cub.cpp20.test.iterator"
+
+# libcudacxx lit: precompile then execute one test
+ci/util/build_and_test_targets.sh \
+  --preset libcudacxx \
+  --lit-precompile-tests "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp" \
+  --lit-tests           "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp"
+
+# Wrapped in the devcontainer
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 -- \
+  ./ci/util/build_and_test_targets.sh \
+    --preset thrust-cpp20 \
+    --build-targets "thrust.cpp20.test.reduce"
+```
+
+## Wraps / calls
+
+- `cmake --preset` — configure step (or `--configure-override` replacement)
+- `ninja -C <build_dir>` — build step for each `--build-targets` entry
+- `ctest --test-dir <build_dir> -R <pattern>` — one invocation per `--ctest-targets` entry
+- `lit -v` — one invocation per `--lit-tests` entry; precompile pass uses `-Dexecutor=NoopExecutor()`
+
+## Notes / gotchas
+
+- `--ctest-targets` runs one `ctest -R <pattern>` per entry; patterns are regex, not glob.
+- `--lit-tests` paths are relative to `libcudacxx/test/libcudacxx/`. Absolute paths will fail.
+- Avoid `--build-targets "libcudacxx.cpp20.precompile.lit"` — it precompiles the entire lit suite.
+- `LIBCUDACXX_SITE_CONFIG` is set automatically from the build directory; do not override it.
+- The script exits on the first failure with a colored banner and elapsed time; subsequent steps are skipped.
diff --git a/.agent/skills/cccl-build/references/build_common.sh_usage.md b/.agent/skills/cccl-build/references/build_common.sh_usage.md
new file mode 100644
index 00000000000..9c8e8d3bfc4
--- /dev/null
+++ b/.agent/skills/cccl-build/references/build_common.sh_usage.md
@@ -0,0 +1,98 @@
+# `ci/build_common.sh` usage
+
+Shared build configuration library sourced by all `ci/build_*.sh` scripts. Not invoked directly.
+Parses the common option set, validates compilers, sets up environment variables, defines helper
+functions (`configure_preset`, `build_preset`, `test_preset`, `configure_and_build_preset`,
+`print_environment_details`, `run_ci_timed_command`), and establishes the build directory layout.
+
+## Location
+
+`ci/build_common.sh`. Must be **sourced** (`source ci/build_common.sh`), not executed. Each
+per-project `ci/build_*.sh` script sources this after extracting its own project-specific flags.
+
+## Interface
+
+```
+Usage: <script> [OPTIONS]
+
+The PARALLEL_LEVEL environment variable controls the amount of build parallelism.
+Default is the number of cores minus one.
+
+Options:
+  -v/-verbose:        enable shell echo for debugging
+  -configure:         Only run cmake to configure, do not build or test.
+  -cuda:              CUDA compiler (Defaults to $CUDACXX if set, otherwise nvcc)
+  -cxx:               Host compiler (Defaults to $CXX if set, otherwise g++)
+  -std:               CUDA/C++ standard (Defaults to 17)
+  -arch:              Target CUDA arches, e.g. "60-real;70;80-virtual" (Defaults to value in presets file)
+  -pedantic/--pedantic: Enable strict warnings-as-errors and expose CCCL header warnings (default in CI)
+  -cmake-options:     Additional options to pass to CMake
+
+Examples:
+  $ PARALLEL_LEVEL=8 ./ci/build_cub.sh
+  $ PARALLEL_LEVEL=8 ./ci/build_cub.sh -cxx g++-9
+  $ ./ci/build_cub.sh -cxx clang++-8
+  $ ./ci/build_cub.sh -configure -arch 80
+  $ ./ci/build_cub.sh -cxx g++-8 -std 14 -arch 80-real -v -cuda /usr/local/bin/nvcc
+  $ ./ci/build_cub.sh -cmake-options "-DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS=-Wfatal-errors"
+```
+
+## Options
+
+| Flag               | Default           | Description                                                             |
+|--------------------|-------------------|-------------------------------------------------------------------------|
+| `-cxx`             | `$CXX` or `g++`   | Host C++ compiler path or name.                                         |
+| `-cuda`            | `$CUDACXX` or `nvcc` | CUDA compiler path or name.                                          |
+| `-std`             | `17`              | C++ standard (14, 17, 20).                                              |
+| `-arch`            | Preset default    | Semicolon-separated CMake `CUDA_ARCHITECTURES` value.                   |
+| `-configure`       | off               | Configure only; skip build and test steps.                              |
+| `-v` / `-verbose`  | off               | Enable `set -x` shell tracing for debugging.                            |
+| `-pedantic`        | on in CI          | Enables `-DCCCL_ENABLE_WERROR=ON -DCCCL_ENABLE_PRAGMA_SYSTEM_HEADER=OFF`. |
+| `-disable-benchmarks` | off           | Force-disable CUB benchmark targets (sets `DISABLE_CUB_BENCHMARKS=1`). |
+| `-cmake-options`   | none              | Extra CMake flags appended to the configure command.                    |
+
+## Environment
+
+| Variable                  | Default                      | Effect                                                           |
+|---------------------------|------------------------------|------------------------------------------------------------------|
+| `PARALLEL_LEVEL`          | `nproc --all --ignore=1`     | Ninja and CTest parallelism.                                     |
+| `CXX`                     | `g++`                        | Overrides default host compiler (superseded by `-cxx` flag).     |
+| `CUDACXX`                 | `nvcc`                       | Overrides default CUDA compiler (superseded by `-cuda` flag).    |
+| `VERBOSE`                 | off                          | Same effect as `-v` when set to non-empty.                       |
+| `PEDANTIC`                | auto (on in CI)              | Enable strict warnings. Set to `1` to force on locally.          |
+| `CCCL_BUILD_INFIX`        | `""`                         | Subdirectory under `../build/` for per-devcontainer isolation.   |
+| `DISABLE_CUB_BENCHMARKS`  | off                          | Skip CUB benchmark targets when set to `1`.                      |
+| `CCCL_CI_COMMAND_TIMEOUT` | `5.5h`                       | Per-step timeout in GHA; prevents orphaned jobs.                 |
+| `MEMMON`                  | off                          | Enable memory monitor logging outside of GHA.                    |
+| `MEMMON_POLL_INTERVAL`    | `5` (sec)                    | Sampling interval for `ci/util/memmon.sh`.                       |
+| `MEMMON_LOG_THRESHOLD`    | `2` (GB)                     | Log entry threshold for memory monitor.                          |
+| `MEMMON_PRINT_THRESHOLD`  | `5` (GB)                     | Print-to-console threshold for memory monitor.                   |
+
+## Build directory layout
+
+```
+../build/
+  ${CCCL_BUILD_INFIX}/       ← devcontainer-specific root
+    ${PRESET}/               ← per-preset build artifacts
+  latest -> ${CCCL_BUILD_INFIX}/
+  preset-latest -> ${CCCL_BUILD_INFIX}/${PRESET}/
+```
+
+## Key functions
+
+| Function                    | Called by            | What it does                                          |
+|-----------------------------|----------------------|-------------------------------------------------------|
+| `configure_preset`          | build scripts        | Runs `cmake --preset` with retry on CI.               |
+| `build_preset`              | build scripts        | Runs `cmake --build --preset`; starts/stops memmon.   |
+| `test_preset`               | build scripts        | Runs `ctest --preset`; prints time summary.           |
+| `configure_and_build_preset`| build scripts        | Combines `configure_preset` + `build_preset`.         |
+| `print_environment_details` | build scripts        | Logs compilers, versions, GPU info, sccache state.    |
+| `run_ci_timed_command`      | build/test functions | Wraps commands with `timeout` in GHA.                 |
+
+## Notes / gotchas
+
+- `PEDANTIC` is automatically enabled inside GitHub Actions even if not passed on the command line.
+- The `-arch` flag corresponds to CMake `CMAKE_CUDA_ARCHITECTURES`; use semicolons as separators (not commas).
+  See `cccl-build` → `references/arch-flag.md` for all valid forms.
+- `sccache` is used automatically when present on `PATH` (standard in the devcontainer).
+- `CCCL_CI_COMMAND_TIMEOUT` only applies inside GitHub Actions. Local runs have no timeout.
diff --git a/.agent/skills/cccl-build/references/docs.md b/.agent/skills/cccl-build/references/docs.md
new file mode 100644
index 00000000000..409498a6e60
--- /dev/null
+++ b/.agent/skills/cccl-build/references/docs.md
@@ -0,0 +1,21 @@
+# Documentation index — cccl-build
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cccl/development/build_and_bisect_tools.rst` | Build and bisect tool reference; preset usage, build directory layout, targeted build patterns. |
+| `CONTRIBUTING.md` | Getting started: fork, branch, devcontainer, pre-commit — the complete first-time setup path. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cccl/contributing.rst` | Repository structure, build workflow, and CI guidelines (Sphinx version of CONTRIBUTING.md). |
+| `nvrtcc/README.md` | Just-in-time CUDA compilation utility bundled with CCCL. |
+
+## See also
+
+- `cccl-test` `references/docs.md` for test-phase documentation.
+- `cccl-bisect` `references/docs.md` for bisect workflow documentation.
+- `cccl-devcontainer` `references/docs.md` for container setup before building.
diff --git a/.agent/skills/cccl-build/references/tools.md b/.agent/skills/cccl-build/references/tools.md
new file mode 100644
index 00000000000..93278d48562
--- /dev/null
+++ b/.agent/skills/cccl-build/references/tools.md
@@ -0,0 +1,31 @@
+# Tool index — cccl-build
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose | Detail |
+|------|---------|--------|
+| `ci/util/build_and_test_targets.sh` | Targeted configure/build/test driver for a single preset. Wraps cmake, ninja, ctest, lit. | `references/build_and_test_targets_usage.md` |
+| `ci/build_common.sh` | Sourced library: option parsing, compiler validation, build dir layout, helper functions for all `ci/build_*.sh` scripts. | `references/build_common.sh_usage.md` |
+| `ci/build_cub.sh` | Full-matrix CUB build: host/std/arch sweep; Launch ID (LID) partitioning for CI artifacts. | see `build_common.sh_usage.md` for common options |
+| `ci/build_thrust.sh` | Full-matrix Thrust build: host/std/arch sweep. | see `build_common.sh_usage.md` |
+| `ci/build_libcudacxx.sh` | Full-matrix libcudacxx build with lit/ctest. | see `build_common.sh_usage.md` |
+| `ci/build_cudax.sh` | Full-matrix cudax build (C++20 only). | see `build_common.sh_usage.md` |
+| `ci/build_cccl_c_parallel.sh` | Full-matrix C Parallel Library build. | see `build_common.sh_usage.md` |
+| `ci/build_cccl_c_parallel_hostjit.sh` | C Parallel hostjit variant build. | see `build_common.sh_usage.md` |
+| `ci/build_cccl_c_stf.sh` | CCCL C library STF test build. | see `build_common.sh_usage.md` |
+| `ci/build_stdpar.sh` | C++ standard parallel algorithms support build. | see `build_common.sh_usage.md` |
+| `ci/build_tidy.sh` | clang-tidy static analysis run across CCCL. | see `build_common.sh_usage.md` |
+| `ci/build_cuda_cccl_wheel.sh` | cuda-cccl Python wheel package build. | see `cccl-python` → `references/tools.md` |
+| `ci/build_cuda_cccl_python.sh` | cuda.cccl Python package in-tree dev build. | see `cccl-python` → `references/tools.md` |
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `.devcontainer/launch.sh` | Spin up or exec into a devcontainer for the build. | `cccl-devcontainer` → `references/tools.md` |
+
+## Notes
+
+`ci/build_cub.sh` accepts `-lid0`, `-lid1`, `-lid2`, `-no-lid` to select the `cub-lid0`, `cub-lid1`,
+`cub-lid2`, or `cub-nolid` CMake preset. These correspond to Launch ID (LID) partitions used in CI
+to split CUB's large test suite across multiple runners.
diff --git a/.agent/skills/cccl-c/SKILL.md b/.agent/skills/cccl-c/SKILL.md
new file mode 100644
index 00000000000..37c8e57900f
--- /dev/null
+++ b/.agent/skills/cccl-c/SKILL.md
@@ -0,0 +1,102 @@
+---
+description: |
+  Tour and orientation for the C Parallel Library (`c/` directory) — stable C ABI exposing CCCL's
+  parallel algorithms for FFI consumers. Covers dir layout, public API surface, the JIT-backed
+  wrapper pattern, test layout, and the experimental STF sublibrary.
+  Triggers: "what is cccl c", "c parallel library", "cccl c bindings", "cccl ffi", "c api".
+---
+
+# C Parallel Library
+
+The C Parallel Library is the stable-ABI C face of CCCL's parallel primitives. It ships as
+`cccl.c.parallel`, a shared library that Python (`cuda.compute`), Numba, and other language
+runtimes load via FFI. All headers require the caller to `#define CCCL_C_EXPERIMENTAL` — the
+entire surface is explicitly experimental and subject to change.
+
+## Directory layout
+
+```
+c/
+├── CMakeLists.txt              — enables parallel/ and experimental/stf/ subprojects
+├── parallel/
+│   ├── CMakeLists.txt          — builds cccl.c.parallel shared library
+│   ├── include/cccl/c/        — public C headers (one per algorithm)
+│   ├── src/                   — CUDA/C++ implementation (one .cu per algorithm)
+│   │   ├── util/              — shared context, error, type, tuning utilities
+│   │   ├── nvrtc/             — NVRTC / nvJitLink helpers
+│   │   ├── jit_templates/     — JIT type-wrapper template system (see below)
+│   │   └── hostjit/           — optional LLVM-backed host JIT (optional, ~20 min build)
+│   ├── test/                  — CTest-based C++ tests (one per algorithm)
+│   │   └── freestanding/      — header-isolation + bitcode tests
+│   └── cmake/                 — CParallelHeaderTesting.cmake
+└── experimental/stf/          — C bindings for the STF (stream task framework) backend
+```
+
+## Public API surface
+
+Headers live under `c/parallel/include/cccl/c/`. One header per algorithm family:
+
+| Header                   | Functions                                                          |
+|--------------------------|---------------------------------------------------------------------|
+| `types.h`                | `cccl_type_info`, `cccl_op_t`, `cccl_iterator_t`, `cccl_value_t`, enums |
+| `reduce.h`               | `cccl_device_reduce_build[_ex]`, `cccl_device_reduce[_nondeterministic]`, `_cleanup` |
+| `scan.h`                 | `cccl_device_scan_build[_ex]`, exclusive/inclusive scan variants, `_cleanup` |
+| `for.h`                  | `cccl_device_for_build[_ex]`, `cccl_device_for`, `_cleanup`        |
+| `transform.h`            | `cccl_device_transform_build[_ex]`, `cccl_device_transform`, `_cleanup` |
+| `radix_sort.h`           | `cccl_device_radix_sort_build[_ex]`, sort variants, `_cleanup`     |
+| `merge_sort.h`           | `cccl_device_merge_sort_build[_ex]`, sort variants, `_cleanup`     |
+| `segmented_reduce.h`     | segmented reduce build/run/cleanup                                 |
+| `segmented_sort.h`       | segmented sort build/run/cleanup                                   |
+| `histogram.h`            | histogram build/run/cleanup                                        |
+| `binary_search.h`        | lower/upper bound build/run/cleanup                                |
+| `three_way_partition.h`  | three-way partition build/run/cleanup                              |
+| `unique_by_key.h`        | unique-by-key build/run/cleanup                                    |
+
+Every algorithm follows the same three-call pattern: `_build` (JIT-compiles a cubin for the
+target SM), `_run` (launches the kernel), `_cleanup` (frees the cubin and library handle).
+Extended `_build_ex` variants accept a `cccl_build_config` for extra compile flags and
+include paths.
+
+## Wrapper pattern
+
+Each `.cu` in `src/` includes the corresponding CUB device algorithm and drives it through a
+two-stage JIT pipeline:
+
+1. `_build` calls use NVRTC + nvJitLink to compile a cubin specialized for the caller's
+   `cccl_iterator_t` and `cccl_op_t` descriptors. Operators may be provided as LTO-IR blobs
+   or as C++ source strings (`cccl_op_code_type`). The compiled cubin and `CUlibrary`/`CUkernel`
+   handles are returned in the `_build_result_t` struct.
+2. `_run` calls load the pre-built cubin and launch the kernel via the CUDA driver API.
+
+The `jit_templates/` subsystem handles type-wrapper generation: it preprocesses C++ template
+headers into embedded string literals (`jit_template_header_contents`) that NVRTC receives as
+part of the compilation unit. This lets the C layer pass custom iterator and operator types
+through to CUB without a C++ ABI dependency.
+
+## Tests
+
+`c/parallel/test/` holds one `test_<algorithm>.cpp` per algorithm. Tests link against
+`cccl.c.parallel` and exercise the build/run/cleanup pattern from C++. The
+`test/freestanding/` subdirectory tests header isolation (no C++ standard library linkage)
+and bitcode paths.
+
+Build with `-DCCCL_C_Parallel_ENABLE_TESTING=ON`. Header isolation tests use
+`-DCCCL_C_Parallel_ENABLE_HEADER_TESTING=ON`.
+
+## Experimental STF sublibrary
+
+`c/experimental/stf/` exposes C bindings for CCCL's stream task framework. Enabled via
+`-DCCCL_ENABLE_C_EXPERIMENTAL_STF=ON`. The single public header is
+`include/cccl/c/experimental/stf/stf.h`. Tests under `stf/test/` cover tasks, logical data,
+places, and CUDA kernels.
+
+## Cross-references
+
+- `cccl-python` — `cuda.compute` in `python/cuda_cccl/` is the primary consumer; it wraps
+  every algorithm in this library via `_bindings.py` and `_cccl_interop.py`.
+- `cccl-build` — build targets and preset flags for `cccl.c.parallel`.
+- `cccl-test` — CTest targets for the test suite.
+
+## Additional resources
+
+- `references/tools.md` — build and test scripts for the C Parallel Library.
diff --git a/.agent/skills/cccl-c/references/tools.md b/.agent/skills/cccl-c/references/tools.md
new file mode 100644
index 00000000000..87291121661
--- /dev/null
+++ b/.agent/skills/cccl-c/references/tools.md
@@ -0,0 +1,13 @@
+# Tool index — cccl-c
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/build_cccl_c_parallel.sh` | Full-matrix C Parallel Library build. | `cccl-build` → `references/tools.md` |
+| `ci/build_cccl_c_parallel_hostjit.sh` | C Parallel hostjit variant build. | `cccl-build` → `references/tools.md` |
+| `ci/build_cccl_c_stf.sh` | CCCL C library STF test build. | `cccl-build` → `references/tools.md` |
+| `ci/test_cccl_c_parallel.sh` | C Parallel Library test. | `cccl-test` → `references/tools.md` |
+| `ci/test_cccl_c_parallel_hostjit.sh` | C Parallel hostjit variant test. | `cccl-test` → `references/tools.md` |
+| `ci/test_cccl_c_stf.sh` | CCCL C STF test. | `cccl-test` → `references/tools.md` |
+| `ci/util/build_and_test_targets.sh` | Targeted build+test for inner-loop iteration. | `cccl-build` → `references/build_and_test_targets_usage.md` |
diff --git a/.agent/skills/cccl-ci-benchmarks/SKILL.md b/.agent/skills/cccl-ci-benchmarks/SKILL.md
deleted file mode 100644
index 956eccc8798..00000000000
--- a/.agent/skills/cccl-ci-benchmarks/SKILL.md
+++ /dev/null
@@ -1,55 +0,0 @@
----
-name: cccl-ci-benchmarks
-description: "Request CCCL benchmark runs in PR CI by editing `ci/bench.yaml`, or launch benchmark workflows directly via `gh workflow run`. Walks the user through filter selection (CUB ninja-target regex / Python path regex), GPU selection, and the `[bench-only]` commit-tag convention. Use when the user wants to benchmark a change on PR CI, or trigger a one-off benchmark workflow. Trigger phrases: \"benchmark this PR\", \"request a perf run\", \"compare benchmarks before/after\"."
----
-
-# cccl-ci-benchmarks
-
-Two routes: PR-driven (edit `ci/bench.yaml`, push) and direct dispatch (`gh workflow run`).
-
-`ci/bench.yaml` holds the request; `ci/bench.template.yaml` is the empty template CI checks against. Both must
-match to merge.
-
-## Route 1 — PR-driven
-
-1. **Edit `ci/bench.yaml`:**
-   - Add CUB benchmark regexes under `benchmarks.filters.cub` (matched against ninja target names, e.g.
-     `^cub\.bench\.for_each\.base`).
-   - Add Python benchmark path regexes under `benchmarks.filters.python` (matched against paths under
-     `benchmarks/`, e.g. `compute/reduce/sum\.py`).
-   - Uncomment at least one GPU under `benchmarks.gpus`: `t4`, `rtx2080`, `rtxa6000`, `l4`, `rtx4090`, `h100`,
-     `rtxpro6000`. Pools are shared — pick conservatively.
-   - Optionally adjust `launch_args` (e.g. `"--cuda 13.2 --host gcc14"`).
-
-2. **Append `[bench-only]`** to the commit message — skips non-benchmark CI (equivalent to
-   `[skip-matrix][skip-vdc][skip-docs][skip-tpt]`).
-
-3. **Push.** Inspect dispatched jobs via `gh run view <RUN_ID>`.
-
-4. **Reset before final merge.** Restore `ci/bench.yaml` to match `ci/bench.template.yaml` (empty filters, no GPUs
-   uncommented).
-
-## Route 2 — direct dispatch
-
-If a benchmark workflow exists for direct dispatch (`gh workflow list --repo NVIDIA/cccl`):
-
-```
-gh workflow run <workflow-name>.yml --repo NVIDIA/cccl --ref <branch> -f <input>=<value>
-```
-
-Return the run URL. `gh workflow run` is mutating; prompts every use.
-
-## Defaults
-
-From `ci/bench.yaml`'s `Advanced` block:
-
-- `base_ref: "origin/main"` — what to compare against.
-- `test_ref: "HEAD"` — what to test.
-- `arch: "native"` — usually fine; can be a list like `"80;90"`.
-- `nvbench_args` — preset with timeout / skip-time / stopping criterion / throttle handling.
-
-## Pitfalls
-
-- Forgetting to uncomment a GPU → no jobs run.
-- Forgetting `[bench-only]` → wasteful full-CI run alongside.
-- Not resetting `ci/bench.yaml` before merge → merge blocked.
diff --git a/.agent/skills/cccl-ci/SKILL.md b/.agent/skills/cccl-ci/SKILL.md
index 4fbfd5faa5e..f0503df8cd6 100644
--- a/.agent/skills/cccl-ci/SKILL.md
+++ b/.agent/skills/cccl-ci/SKILL.md
@@ -1,35 +1,32 @@
 ---
-name: cccl-ci
-description: "Orientation for CCCL's GitHub Actions CI. Pointers to the sources of truth (`ci/matrix.yaml`, `ci-overview.md`, workflow files) and a map of the moving parts. Use when the user asks how CI works here, where a CI behavior is defined, why a job ran or didn't, or what `[skip-*]` tags exist. Trigger phrases: \"how does CI work\", \"where is X CI defined\", \"why did this job run\", \"explain the matrix\". For TRIAGING a CI failure, use `cccl-triage-pr` or `cccl-triage-nightly` instead."
+description: "Orientation for CCCL's GitHub Actions CI: sources of truth, PR run flow, skip tags, override matrix, /ok to test policy, and agent dispatch map. For diagnosing failures, route to cccl-triage instead. Triggers: \"how does CI work\", \"where is X CI defined\", \"why did this job run\", \"explain the matrix\", \"scope this PR's CI\"."
 ---
 
 # cccl-ci
 
+Sources of truth, flow, and the two mechanisms that scope a PR's CI.
+
 ## Sources of truth
 
-| Topic                                         | File                                                              |
-|-----------------------------------------------|-------------------------------------------------------------------|
-| Job matrix (PR / nightly / weekly + override) | `ci/matrix.yaml`                                                  |
-| Skip tags, override rules, troubleshooting    | `ci-overview.md`                                                  |
-| Workflow entry points                         | `.github/workflows/ci-workflow-{pull-request,nightly,weekly}.yml` |
-| `/ok to test` policy + trustees               | `.github/copy-pr-bot.yaml`, `CONTRIBUTING.md` § CI                |
-| Per-job runner setup                          | `.github/actions/workflow-run-job-{linux,windows}/`               |
-| Matrix expansion → dispatchable jobs          | `.github/actions/workflow-build/` running `build-workflow.py`     |
-| Job pruning by changed paths                  | `ci/inspect_changes.py`                                           |
-| Result aggregation                            | `.github/actions/workflow-results/`                               |
-| Bench-request config                          | `ci/bench.yaml`                                                   |
-| Git-bisect cloud dispatch                     | `.github/workflows/git-bisect.yml`                                |
+| Topic | File |
+|-------|------|
+| Job matrix (PR / nightly / weekly + override) | `ci/matrix.yaml` |
+| Skip tags, override rules, troubleshooting | `ci-overview.md` |
+| Workflow entry points | `.github/workflows/ci-workflow-{pull-request,nightly,weekly}.yml` |
+| Per-job runner setup | `.github/actions/workflow-run-job-{linux,windows}/` |
+| Matrix expansion → dispatchable jobs | `.github/actions/workflow-build/` running `build-workflow.py` |
+| Job pruning by changed paths | `ci/inspect_changes.py` |
+| Result aggregation | `.github/actions/workflow-results/` |
+| Bench-request config | `ci/bench.yaml` |
+| Git-bisect cloud dispatch | `.github/workflows/git-bisect.yml` |
 
 ## PR run flow
 
-`ci-workflow-pull-request.yml` → `build-workflow.py` reads `ci/matrix.yaml`. Non-empty `workflows.override` wins;
-otherwise `inspect_changes.py` prunes by dirty projects from changed paths. Jobs run through
-`workflow-run-job-{linux,windows}/` in a devcontainer. `workflow-results/` aggregates; marks failed if any job
-failed OR if override is non-empty.
+`ci-workflow-pull-request.yml` → `build-workflow.py` reads `ci/matrix.yaml`. Non-empty `workflows.override` wins; otherwise `inspect_changes.py` prunes by dirty projects from changed paths. Jobs run in a devcontainer via `workflow-run-job-{linux,windows}/`. `workflow-results/` aggregates; marks failed if any job failed OR if override is non-empty.
 
 ## Scoping a PR's CI (both block merging)
 
-- **`[skip-*]` tags** on the last commit. Tokens in `ci-overview.md`.
+- **`[skip-*]` tags** on the last commit — tokens in `ci-overview.md`.
 - **`workflows.override` in `ci/matrix.yaml`** — replaces the `pull_request` matrix with a targeted subset:
 
   ```yaml
@@ -40,15 +37,32 @@ failed OR if override is non-empty.
 
 `cccl-ci-overrides` generates both from failed-job names and/or changed-path lists.
 
-## `/ok to test` policy
+## /ok to test policy
+
+Draft PRs need `/ok to test <SHA>` from a maintainer to start CI. Route all such requests through `cccl-pr`.
+
+## Agents
+
+| Agent                       | Model  | Purpose                                                                |
+|-----------------------------|--------|------------------------------------------------------------------------|
+| `cccl-ci-overrides`         | sonnet | Generate `workflows.override` entries and/or `[skip-*]` tags from job names and changed paths |
+| `cccl-ci-fetch-failures`    | haiku  | Fetch and list failed jobs for a PR or run                             |
+| `cccl-ci-summarize-job-log` | haiku  | Fetch a single job's log and return a structured failure summary       |
+
+## Benchmarks
+
+CI-side benchmark requests are outside this skill's scope. Use `cccl-bench` for writing benchmarks, running the `cccl.bench` tuning harness, and requesting CI bench runs via `ci/bench.yaml`.
+
+## Additional resources
 
-Draft PRs need `/ok to test <SHA>` from a maintainer to start CI. Route all such requests through the
-`cccl-ok-to-test` agent (SHA-gated).
+- `references/docs.md` — index of CCCL CI documentation.
+- `references/tools.md` — CI-internal scripts with purpose and cross-references.
 
 ## Gotchas
 
 - Non-empty `workflows.override` blocks merge. Reset to empty before final merge (don't remove the key).
-- Any `[skip-*]` tag blocks merge.
+- Any `[skip-*]` tag on the last commit blocks merge.
 - `ci/bench.yaml` must match `ci/bench.template.yaml` to merge.
-- `gh pr view --json statusCheckRollup` returns 100k+ tokens for 500-job PRs. Use `gh pr checks`.
+- `gh pr view --json statusCheckRollup` returns 100k+ tokens for 500-job PRs. Use `gh pr checks` instead.
 - `gh run view --log-failed` errors mid-run. Use `gh api repos/NVIDIA/cccl/actions/jobs/<JID>/logs`.
+- `gh api --paginate` on a logs endpoint returns a JSON array per page; pipe through `jq -s 'add'` to slurp pages before processing.
diff --git a/.agent/skills/cccl-ci/references/docs.md b/.agent/skills/cccl-ci/references/docs.md
new file mode 100644
index 00000000000..43536d894da
--- /dev/null
+++ b/.agent/skills/cccl-ci/references/docs.md
@@ -0,0 +1,19 @@
+# Documentation index — cccl-ci
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `ci-overview.md` | CI environment, matrix.yaml structure, skip tags, override matrix, `/ok to test` policy, and troubleshooting commands. The authoritative user-facing CI reference. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cccl/contributing.rst` | Repository structure, build workflow, testing, and CI guidelines (Sphinx version of CONTRIBUTING.md). |
+
+## See also
+
+- `cccl_detail-ci` `references/docs.md` for the same docs from the CI-internals perspective.
+- `cccl_detail-ci` `references/inspect-changes.md` for the dependency graph behind project scoping.
+- `cccl_detail-ci` `references/matrix-expansion.md` for `build-workflow.py` internals.
diff --git a/.agent/skills/cccl-ci/references/tools.md b/.agent/skills/cccl-ci/references/tools.md
new file mode 100644
index 00000000000..dd1e789017c
--- /dev/null
+++ b/.agent/skills/cccl-ci/references/tools.md
@@ -0,0 +1,24 @@
+# Tool index — cccl-ci
+
+## Owned (CI-internal; not user-invoked directly)
+
+These scripts run inside GitHub Actions jobs and are not meant for direct use. They are documented here for diagnostic and maintenance purposes.
+
+| Tool | Purpose |
+|------|---------|
+| `ci/run_gpu_target.sh` | Entry point for GPU CI jobs: sets environment, launches devcontainer, evaluates the job command, uploads results. |
+| `ci/run_cpu_target.sh` | Entry point for CPU-only CI jobs (e.g. static analysis, packaging tests). |
+| `ci/run_gpu_bisect.sh` | Entry point for GPU bisect jobs dispatched from `.github/workflows/git-bisect.yml`. |
+| `ci/run_cpu_bisect.sh` | Entry point for CPU bisect jobs. |
+| `ci/pretty_printing.sh` | Sourced by build/test scripts: colorized `begin_group`/`end_group` banners, `run_command` with retry, `print_var_values`. |
+| `ci/upload_cub_test_artifacts.sh` | Packages and uploads CUB test artifacts (binaries + metadata) for multi-runner test jobs. |
+| `ci/upload_thrust_test_artifacts.sh` | Packages and uploads Thrust test artifacts. |
+| `ci/upload_job_result_artifacts.sh` | Writes a success/fail record for the `workflow-results` aggregation step. Called unconditionally at job end. |
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/inspect_changes.py` | Classifies dirty projects from changed paths; drives job pruning. | `cccl_detail-ci` → `references/inspect_changes_usage.md` |
+| `ci/util/build_and_test_targets.sh` | Build+test driver called inside CI job containers. | `cccl-build` → `references/build_and_test_targets_usage.md` |
+| `ci/util/git_bisect.sh` | Automated bisect invoked by `run_gpu_bisect.sh`. | `cccl-bisect` → `references/git_bisect_usage.md` |
diff --git a/.agent/skills/cccl-clarify/SKILL.md b/.agent/skills/cccl-clarify/SKILL.md
index 8a6b25ee1c7..2d28d8abece 100644
--- a/.agent/skills/cccl-clarify/SKILL.md
+++ b/.agent/skills/cccl-clarify/SKILL.md
@@ -1,43 +1,39 @@
 ---
-name: cccl-clarify
-description: "Decision-point escalation. Use when you cannot resolve a question through default reasoning — tricky tradeoffs, scarce evidence, ambiguous user intent, or a fork in the road that needs human judgment. Triggered by phrases like \"I'm stuck\", \"not sure how to proceed\", \"should I X or Y\", \"help me decide\". Also invoked by other cccl-* skills when they need to surface a question to the user. Walks the three-step escalation (default reasoning → self-research → ask the user) and the \"how to ask well\" rules — print context in chat, AskUserQuestion with breakdown branch, point-by-point dialogue."
+description: "Decision-point escalation when default reasoning cannot resolve a question — tricky tradeoffs, scarce evidence, ambiguous intent, or a hard-to-reverse fork. Other cccl-* skills route user-question moments here. Triggers: \"I'm stuck\", \"should I X or Y\", \"help me decide\", \"not sure how to proceed\"."
 ---
 
 # cccl-clarify
 
-## Escalation ladder
+Surfaces decisions that default reasoning cannot confidently close. Three-step ladder; stop at the first level that produces a confident answer.
 
-Stop at the first level that produces a confident answer.
+## Step 1 — Default reasoning
 
-1. **Default reasoning** — resolve from existing context: prompt, conversation, files read, `AGENTS.md`, `cccl`
-   skill, memory. Escalate if the tradeoffs are balanced, evidence is thin, the decision is hard to reverse, or
-   intent is genuinely ambiguous.
-2. **Self-research** — cheapest source first: code, memory, in-repo docs (`AGENTS.md`, `CONTRIBUTING.md`,
-   `ci-overview.md`), upstream library docs, web, Explore subagent. Time-box. Two or three rounds without
-   confidence moving = escalate.
-3. **Ask the user** — when research won't close the gap.
+Resolve from existing context: prompt, conversation, files read, `AGENTS.md`, `cccl` skill, memory. Escalate if tradeoffs are balanced, evidence is thin, the decision is hard to reverse, or intent is genuinely ambiguous.
 
-## How to ask well
+## Step 2 — Self-research
 
-1. **Print context in chat.** Tool output isn't visible to the user. Frame the decision, what was tried, the
-   tradeoff axis — in your text, not just in the question prompt.
-2. **`AskUserQuestion` correctly.** 2–4 mutually-exclusive options (or `multiSelect`). Lead with the recommendation
-   and suffix `(Recommended)` when evidence favours it. Each option's `description` carries the substance. Don't
-   add "Other" — UI handles it.
-3. **Offer a breakdown branch** for non-trivial questions — a "walk me through it" option that lets the user defer
-   the pick.
-4. **Breakdown flow.** Offer further research (multi-select with "None — overview"). Then a 200–400 word overview:
-   problem, ordered decision points, tradeoffs, what's already decided. Walk point-by-point — dependent questions
-   sequential, not parallel. Confirm the chosen path end-to-end before acting.
+Cheapest source first: code, memory, in-repo docs (`AGENTS.md`, `CONTRIBUTING.md`, `ci-overview.md`), upstream library docs, web, Explore subagent. Time-box. Two or three rounds without confidence moving = escalate.
+
+## Step 3 — Ask the user
+
+When research won't close the gap:
+
+1. **Print context in chat.** Tool output isn't visible to the user. Frame the decision, what was tried, the tradeoff axis — in your text, not in the question prompt.
+2. **`AskUserQuestion` correctly.** 2–4 mutually-exclusive options (or `multiSelect`). Lead with the recommendation and suffix `(Recommended)` when evidence favours it. Each option's `description` carries the substance. Don't add "Other" — UI handles it.
+3. **Offer a breakdown branch** for non-trivial questions — a "walk me through it" option that lets the user defer the pick. See `references/breakdown-flow.md` for the full walkthrough protocol.
 
 ## When NOT to invoke
 
 - Single-line obvious fixes.
 - Conversational questions — answer them.
-- Decisions whose default is so obvious that asking is noise.
+- Decisions whose default is obvious enough that asking is noise.
 - Questions answered in `AGENTS.md`, the `cccl` skill, or memory.
 
 ## Hard prohibitions
 
 - Never invoke recursively.
 - Never use to defer a decision the user already made.
+
+## Additional resources
+
+- `references/breakdown-flow.md` — full breakdown branch walkthrough: research phase, overview format, point-by-point sequencing, confirmation step.
diff --git a/.agent/skills/cccl-clarify/references/breakdown-flow.md b/.agent/skills/cccl-clarify/references/breakdown-flow.md
new file mode 100644
index 00000000000..2844b9ba371
--- /dev/null
+++ b/.agent/skills/cccl-clarify/references/breakdown-flow.md
@@ -0,0 +1,43 @@
+# Breakdown flow
+
+Used when the user selects the "walk me through it" option on a non-trivial `AskUserQuestion`. Four phases, executed in order.
+
+## Phase 1 — Further research
+
+Offer a multi-select list of research directions relevant to the question. Include a "None — proceed to overview" option. Execute selected directions before continuing.
+
+## Phase 2 — Overview
+
+Write a 200–400 word summary covering:
+
+- **Problem statement** — what needs to be decided and why it matters.
+- **Ordered decision points** — the sequence of choices, not a flat list.
+- **Tradeoffs** — what each option gains and costs; cite specific files or repo facts where available.
+- **What's already settled** — constraints that are not up for debate.
+
+Keep the overview factual. No recommendations yet — the goal is shared understanding before the user commits to anything.
+
+## Phase 3 — Point-by-point walk
+
+Work through each decision point from the overview in sequence.
+
+- Present one question at a time via `AskUserQuestion`.
+- Dependent questions wait for their prerequisite answer before being posed — never run them in parallel.
+- After each answer, summarize the implication briefly in chat before moving to the next point.
+- If an answer makes a later decision point moot, skip it and say so.
+
+## Phase 4 — Confirm chosen path
+
+After all decision points are resolved:
+
+1. Print a concise summary of the full chosen path: each decision point and the selected answer, in order.
+2. Ask the user to confirm before acting.
+3. On confirmation, hand off to the calling skill or proceed with the action.
+
+If the user changes an earlier answer during confirmation, re-walk only the affected downstream points.
+
+## Constraints
+
+- Never skip Phase 4 — acting without confirmation violates the breakdown contract.
+- Keep each `AskUserQuestion` focused on one decision; don't bundle multiple questions into one prompt.
+- The breakdown branch is for non-trivial forks only. Single-question decisions do not need a breakdown.
diff --git a/.agent/skills/cccl-cmake/SKILL.md b/.agent/skills/cccl-cmake/SKILL.md
new file mode 100644
index 00000000000..b30973f9958
--- /dev/null
+++ b/.agent/skills/cccl-cmake/SKILL.md
@@ -0,0 +1,109 @@
+---
+description: |
+  CCCL's CMake configuration system — presets, per-library enable flags, architecture
+  values, and non-preset builds. Covers what presets exist, how to use them, which
+  options to toggle for local dev, and where helper modules live.
+  Triggers: "cmake presets", "configure cccl", "what presets are available",
+  "non-preset build", "list cmake options".
+---
+
+# CMake
+
+Reference and orientation for CCCL's CMake configuration layer. Push cmake module
+internals, custom-command definitions, and arch-flag mechanics to `cccl_detail-cmake`.
+
+## Presets
+
+`CMakePresets.json` at the repo root. List all user-visible presets:
+
+```
+cmake --list-presets
+```
+
+Configure with a preset (Ninja generator, build dir set automatically):
+
+```
+cmake --preset <name>
+```
+
+Build dir lands at `build/$CCCL_BUILD_INFIX/<preset-name>/` relative to the source root.
+
+Key presets:
+
+| Preset                        | Purpose |
+|-------------------------------|---------|
+| `all-dev`                     | All libraries, tests, examples — native arch only. Start here for local dev. |
+| `all-dev-debug`               | Same as `all-dev`, Debug build type, device-side debug (`-G`). |
+| `all-tidy`                    | clang-tidy run, C++17, clang as host and CUDA compiler. |
+| `libcudacxx`, `cub`, `thrust`, `cudax` | Single-library dev with tests. |
+| `libcudacxx-cpp17/20`, `cub-cpp17/20` | Per-library with explicit C++ standard. |
+| `install`, `install-unstable` | Packaging — only stable (or stable+experimental) libs. |
+
+Each per-library preset also has `-cpp17` and `-cpp20` variants. CUB has additional
+launcher-configuration variants (`cub-nolid`, `cub-lid0`, etc.).
+
+## Key CMake options
+
+Toggle these via `-D` on the command line or in a preset:
+
+| Option                        | Default           | Effect                                               |
+|-------------------------------|-------------------|------------------------------------------------------|
+| `CCCL_ENABLE_LIBCUDACXX`      | OFF               | libcudacxx developer build                           |
+| `CCCL_ENABLE_CUB`             | OFF               | CUB developer build                                  |
+| `CCCL_ENABLE_THRUST`          | OFF               | Thrust developer build                               |
+| `CCCL_ENABLE_CUDAX`           | OFF               | cudax developer build (requires `CCCL_ENABLE_UNSTABLE`) |
+| `CCCL_ENABLE_UNSTABLE`        | OFF               | Gate for experimental/unstable targets               |
+| `CCCL_ENABLE_TESTING`         | OFF               | Top-level test targets                               |
+| `CCCL_ENABLE_EXAMPLES`        | OFF               | Example targets                                      |
+| `CCCL_ENABLE_BENCHMARKS`      | OFF               | NVBench benchmark targets (not available with NVHPC) |
+| `CCCL_ENABLE_C_PARALLEL`      | OFF               | C Parallel library                                   |
+| `CCCL_ENABLE_CLANG_TIDY`      | OFF               | clang-tidy integration                               |
+| `CMAKE_CUDA_ARCHITECTURES`   | `all-major-cccl` | GPU arch list (see below)                            |
+
+## CMAKE_CUDA_ARCHITECTURES
+
+CCCL defines two custom values beyond the standard CMake ones:
+
+- `all-major-cccl` — all major architectures supported by the current CTK, filtered to ≥ sm_75. Default in presets.
+- `all-cccl` — all architectures (including minor variants) ≥ sm_75.
+- `native` — detect the GPU in the build machine. Used by `all-dev`.
+
+For an explicit list: `-DCMAKE_CUDA_ARCHITECTURES="80-real;90-real;90-virtual"`.
+
+## Non-preset build
+
+Without presets, configure manually:
+
+```
+cmake -B build \
+  -DCMAKE_CUDA_ARCHITECTURES=native \
+  -DCCCL_ENABLE_CUB=ON \
+  -DCCCL_ENABLE_TESTING=ON
+```
+
+Minimum CMake version for dev builds: 3.21. For embedding via `add_subdirectory`: 3.18.
+
+## Helper modules
+
+All helpers live in `cmake/`. Notable files:
+
+| File                            | Role                                                 |
+|---------------------------------|------------------------------------------------------|
+| `CCCLCheckCudaArchitectures.cmake` | Resolves `all-cccl` / `all-major-cccl` to concrete arch lists |
+| `CCCLDevBuildChecks.cmake`      | Validation checks for top-level dev builds           |
+| `CCCLAddSubdirHelper.cmake`     | Support for `add_subdirectory()` embedding           |
+| `CCCLInstallRules.cmake`        | Install / packaging rules                            |
+| `CCCLGetDependencies.cmake`     | Dependency fetch via CPM                             |
+| `CCCLGenerateHeaderTests.cmake` | Header include-test generation                       |
+| `CCCLConfigureTarget.cmake`     | Target configuration helpers                         |
+| `CCCLUtilities.cmake`           | Common utilities (included first)                     |
+
+Internals of these modules — custom command definitions, `cccl_add_compile_test`, per-lib
+`CMakeLists.txt` structure — are covered in `cccl_detail-cmake`.
+
+## See also
+
+- `cccl-build` — `ci/util/build_and_test_targets.sh`, the recommended driver for targeted
+  local builds and tests. Prefer it over direct `cmake --build` for CI-like iteration.
+- `cccl_detail-cmake` — module internals, `cccl_add_compile_test` and related custom commands,
+  per-library `CMakeLists.txt` walkthrough, arch-flag deep mechanics.
diff --git a/.agent/skills/cccl-commit/SKILL.md b/.agent/skills/cccl-commit/SKILL.md
index 388186ee2de..8ec3f1a997a 100644
--- a/.agent/skills/cccl-commit/SKILL.md
+++ b/.agent/skills/cccl-commit/SKILL.md
@@ -1,20 +1,19 @@
 ---
-name: cccl-commit
-description: "Walk uncommitted changes in a CCCL worktree through an interactive review-and-stage flow: survey the diff, optionally split into multiple commit groups, walk chunks one at a time with diff rendering and an action menu (stage / edit / defer / revert), optionally run a test gate, draft commit message(s), confirm, and commit. Use when committing uncommitted changes, preparing a branch for push, or wrapping up a fix. Trigger phrases: \"commit these changes\", \"wrap this up\", \"ready to commit\", \"stage and commit\", \"prepare commits\", \"split into commits\". For PR creation or `/ok to test`, route to `cccl-pr` after committing."
+description: "Interactive commit prep — survey changes, split into commit groups, walk chunks with a diff render and action menu, run a test gate, draft commit messages, and commit. Refuses on `main`. Triggers: \"commit these changes\", \"wrap this up\", \"ready to commit\", \"stage and commit\", \"split into commits\"."
 ---
 
 # cccl-commit
 
-Interactive commit prep. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Scratch dir:
+Interactive commit prep. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Scratch:
 `mkdir -p /tmp/claude/<sessionid>`.
 
 ## Step 1 — Component selection
 
-`AskUserQuestion`, `multiSelect: true`:
+`cccl-clarify`, `multiSelect: true`:
 
 - **Split** — group hunks into multiple commits.
-- **Interactive** — walk each chunk with a diff render + action menu.
-- **Test gate** — run `pre-commit` and a build/test target before committing.
+- **Interactive** — walk each chunk with a diff render and action menu.
+- **Test gate** — run `pre-commit` and/or a build/test target before committing.
 - **Commit** — write messages and execute. Without this, nothing commits.
 
 Commit-only with no Split / no Interactive → fast path: commit whatever is staged (Step 5).
@@ -32,92 +31,64 @@ Single Bash each:
 
 `git diff > /tmp/claude/<sessionid>/patch.txt` (or `git diff HEAD` for combined).
 
-Plan into commit groups CC-NN (one group if Split not selected). Within each group, slice into chunks; write each
-slice to `/tmp/claude/<sessionid>/chunks/CC-NN.patch`. Coverage check: sum-of-slice-hunks == total-hunks. Run
-`git apply --check chunks/CC-NN.patch` on every slice.
+Plan into commit groups CC-NN. Within each group, slice into chunks; write each slice to
+`/tmp/claude/<sessionid>/chunks/CC-NN.patch`. Coverage check: sum-of-slice-hunks == total-hunks.
+Run `git apply --check chunks/CC-NN.patch` on every slice.
 
 Present plan summary (groups, chunks/group, total lines). `cccl-clarify` → approve / reorder / discuss.
 
 ## Step 4 — Walk chunks (if Interactive)
 
-For each chunk in planned order:
+See `references/walkthrough-rules.md` for diff display rules, the action menu, and tracking.
 
-1. Read `chunks/CC-NN.patch`.
-2. Render the diff verbatim in chat as a ` ```diff ` fenced block, per-hunk headers naming file:line range.
-   Never use Bash output for diffs. Pattern dedup is fine for repetition — show pattern once, list other
-   occurrences and locations.
-3. Suggest improvements (numbered, with file:line refs) or note "No suggested changes".
-4. `AskUserQuestion`:
-   - **Stage as-is** — `git apply --cached chunks/CC-NN.patch`. Verify with `git diff --cached --stat`; STOP if
-     the staged file list doesn't match the expected set.
-   - **Apply suggested edits, re-review** — `Edit`, regenerate diff with `git diff -- <files>`, loop.
-   - **Apply custom edits, re-review** — user describes, `Edit`, loop.
-   - **Leave unstaged** — defer.
-   - **Revert** — `git apply -R chunks/CC-NN.patch` (or `git checkout -- <file>` for whole-file).
-   - **Discuss** — open conversation; loop.
-
-Track: current group, staged/deferred/reverted chunks.
-
-Split selected, Interactive not → auto-stage each slice in order. Verify the staged set grows monotonically into
-the per-group expected set. STOP on divergence.
+Split selected, Interactive not → auto-stage each slice in order. Verify the staged set grows
+monotonically into the per-group expected set. STOP on divergence.
 
 ## Step 5 — Test gate (if selected) + commit
 
 ### 5.0 Fast path
 
-Commit-only with no Split / no Interactive: confirm staged set via `git diff --cached --stat` (empty → exit),
-skip the test gate unless asked, go to 5.2.
+No Split / no Interactive: confirm staged set via `git diff --cached --stat` (empty → exit),
+skip test gate unless asked, go to 5.2.
 
 ### 5.0a Optional CI scoping (last commit only)
 
-Before drafting the last commit's message, route through `cccl-clarify`: offer to scope the next CI run via
-`cccl-ci-overrides` — override matrix (writes `workflows.override` into `ci/matrix.yaml`; re-stage + re-run
-pre-commit) and/or `[skip-*]` tags on the last commit's last line. Both block merge — remind the user to reset
-before final merge.
+Before drafting the last commit message, offer via `cccl-clarify`: scope the next CI run
+via `cccl-ci-overrides` — `workflows.override` in `ci/matrix.yaml` and/or `[skip-*]` tags.
+Both block merge — remind the user to reset before final merge.
 
 ### 5.1 Tests
 
-`cccl-clarify` → skip / `pre-commit run --files <staged>` / dispatch `cccl-build-and-test-targets`. If
-`pre-commit` is absent, venv-install it (`python3 -m venv .venv && .venv/bin/pip install pre-commit`).
-
-Many pre-commit hooks auto-fix in place (`pretty-format-json`, `end-of-file-fixer`,
-`trim-trailing-whitespace`, `ruff format`). On failure with auto-fixes applied:
-1. Show the resulting `git diff` per fixed file.
-2. For each file, route through `cccl-clarify` — re-stage / revert / discuss — same flow as Step 4's per-chunk
-   action menu. Never bulk-`git add` the fixes.
-3. Re-run `pre-commit run --files <staged>` to confirm clean.
-
-Other failures: investigate / commit anyway / abort via `cccl-clarify`.
+`cccl-clarify` → skip / `pre-commit run --files <staged>` / dispatch `cccl-test` (or `cccl-build` if a build is needed first).
+See `references/pre-commit-autofix.md` for the auto-fix / re-stage flow and edge cases.
 
 ### 5.2 Commit message
 
 `cccl-clarify` for detail tier — **Trivial** (subject only) / **Standard** (subject + 1–6 body lines) /
-**Detailed** (subject + multi-paragraph).
-
-Rules:
-- Subject ≤ 72 chars, imperative, no trailing period.
-- Match CCCL's prefix convention from `git log --oneline -20`.
-- Body wraps ~72 chars.
-- No co-author / tool-attribution footers.
-- `[skip-*]` tags apply to a single push and must appear on the LAST commit's last line only.
+**Detailed** (subject + multi-paragraph). See `references/commit-message-rules.md` for conventions.
 
 Draft. `cccl-clarify` → use / revise / cancel.
 
 ### 5.3 Commit
 
-Write final message to `/tmp/claude/<sessionid>/commit-msg-CC.txt`. Then `git commit -F <path>` (mutating; expect
-prompt). Verify with `git show -p HEAD`: SHA, subject, file list match expectations.
+Write final message to `/tmp/claude/<sessionid>/commit-msg-CC.txt`. Then `git commit -F <path>`.
+Verify with `git show -p HEAD`: SHA, subject, file list match expectations.
 
 ## Step 6 — Inter-group transition (if Split)
 
-After each commit, `cccl-clarify` → continue / pause / end. On continue, verify remaining slices still apply
-(`git apply --check` per remaining slice); regenerate the patch and re-plan if any fail.
+After each commit, `cccl-clarify` → continue / pause / end. On continue, verify remaining slices
+still apply (`git apply --check` per remaining slice); regenerate and re-plan if any fail.
 
-Last group → final summary (all SHAs, deferred, reverted) and exit. (CI scoping was offered in Step 5.0a.)
+Last group → final summary (all SHAs, deferred, reverted) and exit.
+
+## Good-enough criterion
+
+All selected commit groups have landed with verified SHAs; no deferred chunks remain unless the user
+explicitly left them.
 
 ## Hard prohibitions
 
-Unless explicitly approved by the user in `cccl-clarify` at the moment of action, never do any of the following:
+Unless explicitly approved by the user in `cccl-clarify` at the moment of action:
 
 - Never edit on `main`.
 - Never `--no-verify`.
@@ -132,3 +103,9 @@ In any circumstance:
 ## Handoff
 
 After commits land: route to `cccl-pr` for push / open / update / `/ok to test`.
+
+## Additional resources
+
+- `references/walkthrough-rules.md` — per-chunk diff display, action menu, tracking.
+- `references/pre-commit-autofix.md` — pre-commit failure, auto-fix detection, re-stage flow.
+- `references/commit-message-rules.md` — subject / body conventions, tag prefixes, prohibited content.
diff --git a/.agent/skills/cccl-commit/references/commit-message-rules.md b/.agent/skills/cccl-commit/references/commit-message-rules.md
new file mode 100644
index 00000000000..cdec4af55c0
--- /dev/null
+++ b/.agent/skills/cccl-commit/references/commit-message-rules.md
@@ -0,0 +1,47 @@
+# Commit message rules
+
+Used by `cccl-commit` Step 5.2.
+
+## Subject line
+
+- 72 characters maximum.
+- Imperative mood: "Add X", "Fix Y", not "Added X" or "Fixes Y".
+- No trailing period.
+- Match CCCL's prefix convention — inspect `git log --oneline -20` before drafting.
+
+Common prefix patterns (verify against log; do not invent):
+
+```
+[libcudacxx] ...
+[cub] ...
+[thrust] ...
+[cudax] ...
+[ci] ...
+[docs] ...
+```
+
+## Body
+
+- Wrap lines at ~72 characters.
+- Separate from subject with one blank line.
+- Explain what changed and why; omit what is obvious from the diff.
+- No story paragraphs ("Surfaced while debugging …", "Found during …").
+
+## Detail tiers
+
+| Tier     | When                              | Content                               |
+|----------|-----------------------------------|---------------------------------------|
+| Trivial  | Mechanical change, obvious from diff | Subject only                        |
+| Standard | Most commits                      | Subject + 1–6 body lines              |
+| Detailed | Complex change, non-obvious rationale | Subject + multi-paragraph body      |
+
+## Skip tags
+
+`[skip-*]` tags scope a single CI push and belong only on the **last commit's last line**.
+They block merge if left in place — remind the user to remove them before final merge.
+
+## Prohibited content
+
+- No co-author lines (`Co-authored-by:`, `Co-Authored-By:`).
+- No tool-attribution footers ("Generated with …", "AI-assisted").
+- No marketing adjectives ("powerful", "robust", "comprehensive").
diff --git a/.agent/skills/cccl-commit/references/pre-commit-autofix.md b/.agent/skills/cccl-commit/references/pre-commit-autofix.md
new file mode 100644
index 00000000000..5cc22f98839
--- /dev/null
+++ b/.agent/skills/cccl-commit/references/pre-commit-autofix.md
@@ -0,0 +1,44 @@
+# Pre-commit autofix flow
+
+Used by `cccl-commit` Step 5.1.
+
+## Install if absent
+
+If `pre-commit` is not on `PATH`, install it into a local venv:
+
+```
+python3 -m venv .venv
+.venv/bin/pip install pre-commit
+```
+
+Then run `.venv/bin/pre-commit run --files <staged>`.
+
+## Auto-fixing hooks
+
+Several hooks modify files in place on failure:
+
+- `pretty-format-json`
+- `end-of-file-fixer`
+- `trim-trailing-whitespace`
+- `ruff format`
+
+When `pre-commit` exits non-zero and the working tree has changed, treat it as an auto-fix run.
+
+## Auto-fix / re-stage flow
+
+1. Show the resulting `git diff` for each modified file.
+2. For each file, route through `cccl-clarify`:
+   - **Re-stage** — `git apply --cached` the per-file diff.
+   - **Revert** — `git checkout -- <file>`.
+   - **Discuss** — open conversation; loop.
+   Never bulk-`git add` the fixes.
+3. Re-run `pre-commit run --files <staged>` to confirm clean.
+
+## Non-auto-fix failures
+
+Hooks that report errors without modifying files (type-checking, lint violations, custom validators):
+
+`cccl-clarify` → investigate and fix / commit anyway / abort.
+
+"Commit anyway" is only appropriate for failures the user understands and accepts; never
+suppress with `--no-verify` without explicit user approval at the moment of action.
diff --git a/.agent/skills/cccl-commit/references/walkthrough-rules.md b/.agent/skills/cccl-commit/references/walkthrough-rules.md
new file mode 100644
index 00000000000..189a71e4eeb
--- /dev/null
+++ b/.agent/skills/cccl-commit/references/walkthrough-rules.md
@@ -0,0 +1,37 @@
+# Walkthrough rules
+
+Used by `cccl-commit` Step 4.
+
+## Per-chunk diff display
+
+For each chunk in planned order:
+
+1. Read `chunks/CC-NN.patch`.
+2. Render the diff verbatim in chat as a ` ```diff ` fenced block. Per-hunk headers name the
+   file and line range. Never use Bash output for diffs.
+3. Pattern dedup is allowed for repeated patterns — show the pattern once, list all other
+   occurrences with file:line refs.
+4. Suggest improvements (numbered, with file:line refs) or note "No suggested changes".
+
+## Action menu
+
+Present via `cccl-clarify`:
+
+- **Stage as-is** — `git apply --cached chunks/CC-NN.patch`. Verify with
+  `git diff --cached --stat`; STOP if the staged file list doesn't match the expected set.
+- **Apply suggested edits, re-review** — `Edit`, regenerate diff with `git diff -- <files>`, loop.
+- **Apply custom edits, re-review** — user describes changes, `Edit`, loop.
+- **Leave unstaged** — defer; move to next chunk.
+- **Revert** — `git apply -R chunks/CC-NN.patch` (or `git checkout -- <file>` for whole-file).
+- **Discuss** — open conversation; loop back to the action menu when resolved.
+
+## Tracking
+
+Maintain per-group state:
+
+- Current group identifier (CC-NN).
+- List of staged chunks.
+- List of deferred chunks.
+- List of reverted chunks.
+
+Report the state summary at the end of each group before proceeding to Step 5.
diff --git a/.agent/skills/cccl-cpp-builds/SKILL.md b/.agent/skills/cccl-cpp-builds/SKILL.md
deleted file mode 100644
index 348244c5272..00000000000
--- a/.agent/skills/cccl-cpp-builds/SKILL.md
+++ /dev/null
@@ -1,53 +0,0 @@
----
-name: cccl-cpp-builds
-description: "Build and test CCCL's C++ libraries (libcudacxx, CUB, Thrust, cudax, C Parallel) — per-project `ci/build_*.sh` and `ci/test_*.sh` full-matrix scripts, architecture conventions, and pointers to the targeted-build alternative. Use when the user wants to build or test a CCCL C++ library across a full host/std/arch matrix, or asks about architecture flag syntax. Trigger phrases: \"build cub\", \"test libcudacxx\", \"build thrust\", \"full matrix build\", \"compile cudax\", \"cuda architectures\". For SINGLE-target fast iteration use `cccl-build-and-test-targets` instead."
----
-
-# cccl-cpp-builds
-
-Per-project full-matrix build + test scripts under `ci/`. Flags: host compiler, C++ standard, GPU architectures.
-
-Full builds: 60+ min build, 30+ min test — never cancel. For single targets, use `cccl-build-and-test-targets`.
-
-## Scripts
-
-```
-./ci/build_<project>.sh  [-cxx <compiler>] [-std <std>] [-arch "<arch-list>"]   # no GPU
-./ci/test_<project>.sh    -cxx <compiler>   -std <std>   -arch "<arch-list>"    # GPU required
-```
-
-| Project           | Build / test scripts        | Stds      |
-|-------------------|-----------------------------|-----------|
-| CUB               | `build_cub`, `test_cub`     | 17, 20    |
-| Thrust            | `build_thrust`, `test_thrust` | 17, 20  |
-| libcudacxx        | `build_libcudacxx`, `test_libcudacxx` | 17, 20 |
-| cudax             | `build_cudax`, `test_cudax` | 20 only   |
-| C Parallel        | `build_cccl_c_parallel`     | 17 only   |
-
-Test scripts build implicitly if the tree is missing. CTest preset form (e.g. `ctest --preset=cub-cpp17`) also
-works.
-
-Compute-sanitizer variants: append `-compute-sanitizer-{memcheck,racecheck,initcheck,synccheck}`. Not all
-projects support all tools — check `--help`.
-
-## Flags
-
-- **`-cxx`** — host compiler (`g++`, `clang++`, `msvc14.39`).
-- **`-std`** — C++ standard (`17` or `20`, subject to project limits above).
-- **`-arch`** — semicolon-separated CUDA architecture list (CMake `CUDA_ARCHITECTURES`):
-
-  | Form             | Generates             |
-  |------------------|-----------------------|
-  | `<XX>`           | PTX + SASS for SM XX  |
-  | `<XX-real>`      | SASS only             |
-  | `<XX-virtual>`   | PTX only              |
-  | `native`         | Detect host GPU       |
-  | `all-major-cccl` | Default for PR builds |
-
-  Examples: `"native"`, `"80"`, `"70;75;80-virtual"`.
-
-## Performance
-
-- `sccache` is enabled in the devcontainer (CCCL-team bucket auth).
-- Limit `-arch` — `"native"` or `"80"` is much faster than `"all-major-cccl"`.
-- Build scripts already parallelize via ninja.
diff --git a/.agent/skills/cccl-cub/SKILL.md b/.agent/skills/cccl-cub/SKILL.md
new file mode 100644
index 00000000000..6a013f89bea
--- /dev/null
+++ b/.agent/skills/cccl-cub/SKILL.md
@@ -0,0 +1,121 @@
+---
+description: |
+  Tour and orientation for the CUB subdirectory — what the library is, how the
+  include tree is organized across block/warp/device/agent layers, test suite
+  layout and naming, the tuning policy mechanism, and how CUB integrates with
+  the CCCL buildsystem.
+  Triggers: "what is cub", "cub overview", "cub primitives", "cub block scan",
+  "cub device reduce", "cub tuning policy".
+---
+
+# CUB
+
+CUB is CCCL's CUDA primitive library. It provides cooperative algorithms at
+three hardware scopes — thread block, warp, and full device — plus internal
+agent-level building blocks used to compose device-wide algorithms.
+
+## Directory layout
+
+| Path                              | Contents                                                             |
+|-----------------------------------|----------------------------------------------------------------------|
+| `cub/cub/block/`                  | Block-level cooperative primitives (`BlockReduce`, `BlockScan`, `BlockSort`, …) |
+| `cub/cub/warp/`                   | Warp-level primitives (`WarpReduce`, `WarpScan`, `WarpExchange`, …) |
+| `cub/cub/device/`                 | Device-wide dispatch facades (`DeviceReduce`, `DeviceScan`, `DeviceSort`, …) |
+| `cub/cub/device/dispatch/`        | Dispatch layer: one `dispatch_*.cuh` per algorithm                   |
+| `cub/cub/device/dispatch/tuning/` | Tuning policy structs, one `tuning_*.cuh` per algorithm              |
+| `cub/cub/agent/`                  | Internal multi-block agents — not part of the public API             |
+| `cub/cub/iterator/`               | Iterator adapters (cache-modified, transform, etc.)                  |
+| `cub/cub/thread/`                 | Single-thread primitives (thread reduce, scan, load/store)           |
+| `cub/cub/grid/`                   | Grid-scope utilities (even-share, mapping, queue)                    |
+| `cub/cub/detail/`                 | Internal helpers — not part of the public API                        |
+| `cub/cmake/`                      | CMake helpers: `CubUtilities.cmake`, `CubHeaderTesting.cmake`, etc.  |
+| `cub/test/`                       | Catch2 and legacy CTest test suite                                   |
+| `cub/benchmarks/`                 | Performance benchmarks (enabled via `CCCL_ENABLE_BENCHMARKS`)        |
+| `cub/examples/`                   | Usage examples under `block/` and `device/` subdirs                  |
+
+## Header conventions
+
+The umbrella include is `<cub/cub.cuh>`. It is not usable from NVRTC — NVRTC
+callers must include specific headers (e.g., `<cub/device/device_reduce.cuh>`).
+Every header follows the same structure:
+
+1. `#pragma once`
+2. `#include <cub/config.cuh>` — pulls in namespace macros and compiler config.
+3. System-header pragmas (`_CCCL_IMPLICIT_SYSTEM_HEADER_*` guards).
+4. Implementation includes, then `CUB_NAMESPACE_BEGIN` / `CUB_NAMESPACE_END`.
+
+Prefer `<cub/device/device_X.cuh>` for device algorithms; the block headers for
+kernel code. The `agent/` and `detail/` subtrees are internal — treat anything
+not under `block/`, `warp/`, `device/`, `iterator/`, or `thread/` as unstable.
+
+## Tuning policy mechanism
+
+Every device algorithm has a matching `tuning_*.cuh` file under
+`cub/cub/device/dispatch/tuning/`. Each file defines:
+
+- A **policy struct** (e.g., `detail::reduce::reduce_policy`) aggregating
+  per-kernel parameters: `threads_per_block`, `items_per_thread`, block
+  algorithm, load modifier, vector size.
+- A **policy selector** — a `constexpr` function or concept-constrained
+  overload set that picks parameters based on `compute_capability`, accumulator
+  type, and offset type.
+
+The dispatch layer (`dispatch_*.cuh`) queries the selector at compile time and
+instantiates the agent template. Users can inject a custom policy hub by passing
+it as a template argument to the `Dispatch*` type — tests named
+`*_custom_policy_hub.cu` exercise this path.
+
+See `references/tuning-policies.md` for the full struct shapes and how to
+author a custom policy hub.
+
+## Test suite layout
+
+Tests live in `cub/test/`. Two naming conventions:
+
+- `catch2_test_<scope>_<algorithm>[_variant].cu` — Catch2-based; the dominant
+  new style. Discovered automatically by `GLOB_RECURSE`.
+- `test_<name>.cu` — legacy style; uses a bespoke test harness with
+  `CUB_DEBUG_SYNC` enabled.
+
+**`%PARAM%` variant expansion.** Tests that need to cover multiple compile-time
+configurations embed directives of the form:
+
+```
+// %PARAM% TEST_DIM_X dimx 1:7:32:65:128
+```
+
+`cccl_parse_variant_params` reads these at CMake configure time and generates
+one CTest target per combination. Target names take the form
+`cub.test.<scope>.<algorithm>.<label>` (e.g.,
+`cub.test.block.reduce.dimx_32.dimyz_1`).
+
+**`_fail.cu` tests** compile-test for expected diagnostic errors; they use
+`cccl_add_xfail_compile_target_test` and carry `expected-error` regex markers.
+
+**`_api.cu` tests** exercise the public API surface with `--extended-lambda`
+enabled.
+
+Run targeted builds and tests via `cccl-build` and `cccl-test`.
+
+## Buildsystem integration
+
+CUB is built as part of CCCL via the `cub` preset family. When included from the
+CCCL superproject, `CCCL_ENABLE_CUB` controls whether the full `cub/CMakeLists.txt`
+is processed. Key CMake options:
+
+| Option                    | Default | Effect                                |
+|---------------------------|---------|---------------------------------------|
+| `CUB_ENABLE_HEADER_TESTING` | ON      | Compile-test all public headers       |
+| `CUB_ENABLE_TESTING`      | ON      | Build `cub/test/` targets             |
+| `CUB_ENABLE_EXAMPLES`     | ON      | Build `cub/examples/` targets         |
+| `CUB_ENABLE_TUNING`       | OFF     | Build tuning-exploration benchmarks   |
+
+CTest targets follow `cub.test.<scope>.<algorithm>[.<variant>]`. The
+`cub.compiler_interface` target carries compiler flags; all test targets link it.
+
+## Additional resources
+
+- `references/tuning-policies.md` — policy struct shapes, per-algorithm selector
+  pattern, and how to author a custom policy hub
+- `references/docs.md` — index of CUB documentation (API reference, developer overview, tuning).
+- `references/tools.md` — build and test scripts for CUB.
diff --git a/.agent/skills/cccl-cub/references/docs.md b/.agent/skills/cccl-cub/references/docs.md
new file mode 100644
index 00000000000..efc0954d075
--- /dev/null
+++ b/.agent/skills/cccl-cub/references/docs.md
@@ -0,0 +1,25 @@
+# Documentation index — cccl-cub
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cub/index.rst` | CUB overview: parallel primitives, cooperative algorithms, performance tuning. |
+| `docs/cub/api.rst` | Doxygen-extracted CUB API reference and usage patterns. |
+| `docs/cub/api/index.rst` | Auto-generated API docs for all CUB namespaces and classes (and subdirectories). |
+| `docs/cub/developer_overview.rst` | Internal architecture, kernel composition, and development guide for contributors. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cub/test_overview.rst` | Test infrastructure, `%PARAM%` parameterization, and coverage strategy. |
+| `docs/cub/tuning.rst` | Performance tuning guide: policy selectors, tuning macros, dispatch policies. |
+| `docs/cub/policy_selectors.rst` | Algorithm tuning via policy selectors for different GPU architectures. |
+| `docs/cub/benchmarking.rst` | Benchmarking infrastructure and performance measurement workflows. |
+| `cub/examples/README.md` | Example code for CUB block and device primitives. |
+
+## See also
+
+- `cccl-bench` `references/docs.md` for benchmarking documentation.
+- `cccl_detail-test-params` for `%PARAM%` test parameterization internals.
diff --git a/.agent/skills/cccl-cub/references/tools.md b/.agent/skills/cccl-cub/references/tools.md
new file mode 100644
index 00000000000..0258c0c8c16
--- /dev/null
+++ b/.agent/skills/cccl-cub/references/tools.md
@@ -0,0 +1,9 @@
+# Tool index — cccl-cub
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/build_cub.sh` | Full-matrix CUB build with Launch ID (LID) partitioning for CI artifact splitting. | `cccl-build` → `references/tools.md` |
+| `ci/test_cub.sh` | Full-matrix CUB test: host/std/arch sweep; requires GPU. | `cccl-test` → `references/tools.md` |
+| `ci/util/build_and_test_targets.sh` | Targeted build+test for inner-loop iteration against a single preset. | `cccl-build` → `references/build_and_test_targets_usage.md` |
diff --git a/.agent/skills/cccl-cub/references/tuning-policies.md b/.agent/skills/cccl-cub/references/tuning-policies.md
new file mode 100644
index 00000000000..2aa96b1d662
--- /dev/null
+++ b/.agent/skills/cccl-cub/references/tuning-policies.md
@@ -0,0 +1,85 @@
+# CUB Tuning Policies
+
+## Policy struct shapes
+
+Each device algorithm defines its policy in `cub/cub/device/dispatch/tuning/tuning_<algo>.cuh`.
+A typical policy aggregates sub-structs, one per kernel phase. Example for reduce:
+
+```cpp
+namespace detail::reduce {
+
+struct agent_reduce_policy {
+  int threads_per_block;
+  int items_per_thread;
+  int vec_size;
+  BlockReduceAlgorithm block_algorithm;
+  CacheLoadModifier load_modifier;
+};
+
+struct reduce_policy {
+  agent_reduce_policy reduce;
+  agent_reduce_policy single_tile;
+};
+
+} // namespace detail::reduce
+```
+
+The legacy style uses nested `struct`s with static constants (`ReducePolicy`,
+`SingleTilePolicy`, etc.) accessed via `CUB_DEFINE_SUB_POLICY_GETTER`. New code uses
+the aggregate style above. The dispatch layer handles both via `ReducePolicyWrapper`.
+
+## Policy selector pattern
+
+The selector is a `constexpr` function that maps runtime-detected
+`compute_capability` + type-level traits to a `reduce_policy`:
+
+```cpp
+template <class AccumT, class OffsetT>
+_CCCL_HOST_DEVICE constexpr reduce_policy get_policy(
+  compute_capability cc, ...) noexcept;
+```
+
+Type-level classification helpers (`classify_accum_size<T>()`,
+`classify_offset_size<T>()`, `op_type`) turn the template arguments into
+discriminants so the selector stays `constexpr`.
+
+## Authoring a custom policy hub
+
+A custom policy hub is a struct with the same nested policy types the dispatch
+layer queries. Pass it as the `PolicyHub` template argument to the dispatch type:
+
+```cpp
+struct MyReduceHub {
+  struct MaxPolicy {
+    struct ReducePolicy : cub::AgentReducePolicy<256, 16, int, 4,
+      cub::BLOCK_REDUCE_WARP_REDUCTIONS, cub::LOAD_LDG> {};
+    struct SingleTilePolicy : ReducePolicy {};
+  };
+};
+
+// Invoke with a custom hub:
+cub::DeviceReduce::DispatchReduce<...>::Dispatch<MyReduceHub>(...);
+```
+
+Tests named `catch2_test_device_<algo>_custom_policy_hub.cu` demonstrate the
+full pattern for each algorithm. Read those before writing a new hub.
+
+## Where policies live per algorithm
+
+| Algorithm            | Tuning file                                                                              |
+|----------------------|------------------------------------------------------------------------------------------|
+| Reduce               | `tuning/tuning_reduce.cuh` + `tuning_reduce_deterministic.cuh` / `_nondeterministic.cuh` |
+| Scan                 | `tuning/tuning_scan.cuh`, `tuning_scan_by_key.cuh`                                       |
+| Sort (radix)         | `tuning/tuning_radix_sort.cuh`                                                           |
+| Sort (merge)         | `tuning/tuning_merge_sort.cuh`                                                           |
+| Histogram            | `tuning/tuning_histogram.cuh`                                                            |
+| Select / Partition   | `tuning/tuning_select_if.cuh`, `tuning_three_way_partition.cuh`                          |
+| Run-length encode    | `tuning/tuning_rle_encode.cuh`, `tuning_rle_non_trivial_runs.cuh`                        |
+| Others               | `tuning/tuning_<algo>.cuh` — pattern is uniform                                          |
+
+## Compute-capability dispatch
+
+Selectors branch on `compute_capability` values (e.g., `sm_80`, `sm_90`). The
+`cuda/__device/compute_capability.h` header provides the comparison operators.
+Selectors are evaluated at device-function instantiation time, so the compiler
+sees the final constant folded policy — no runtime branching.
diff --git a/.agent/skills/cccl-cudax/SKILL.md b/.agent/skills/cccl-cudax/SKILL.md
new file mode 100644
index 00000000000..71eecc74012
--- /dev/null
+++ b/.agent/skills/cccl-cudax/SKILL.md
@@ -0,0 +1,94 @@
+---
+description: |
+  Tour and orientation for the cudax subdirectory — what CUDA Experimental is,
+  the no-stability-guarantee contract, include tree layout, major feature areas
+  (streams, containers, memory resources, execution, STF, places, graph,
+  copy), test suite structure, and how features graduate to stable CCCL libraries.
+  Triggers: "what is cudax", "cudax overview", "cudax experimental",
+  "cuda::experimental", "cudax features".
+---
+
+# cudax
+
+cudax (`cuda/experimental/`) is CCCL's staging ground for features under active
+design. Everything in the `cuda::experimental::` namespace carries zero stability
+guarantees — API and ABI can change or disappear without notice, at any cadence.
+It is not shipped with the CUDA Toolkit; it is available only from the CCCL
+GitHub repository. C++17 or newer is required; NVCC 12.3+ with GCC 7+, Clang 9+,
+or MSVC 2019+ as host compiler.
+
+## Stability contract
+
+No stability guarantees whatsoever. Features live here while their design
+solidifies and the community provides feedback. Once a feature is considered
+ready, it graduates to a stable CCCL library (`libcudacxx`, CUB, or Thrust) with
+a stable API. There is no documented timeline or graduation checklist; graduation
+happens on maintainer judgment.
+
+The CMake target is `cudax::cudax`, exposed only when `CCCL_ENABLE_UNSTABLE` is
+set before `find_package` or `add_subdirectory`.
+
+## Include tree
+
+Public headers live under `cudax/include/cuda/experimental/`. Each feature area
+has a top-level `.cuh` entry point and a `__<area>/` detail directory.
+
+| Entry header        | Feature area                                                         |
+|---------------------|----------------------------------------------------------------------|
+| `container.cuh`     | `uninitialized_buffer`, `graph_buffer`                               |
+| `stream.cuh`        | `stream`, `stream_ref`                                               |
+| `memory_resource.cuh` | Stream-ordered memory resources, `graph_resource`                    |
+| `execution.cuh`     | stdexec-based async execution model                                  |
+| `launch.cuh`        | Typed kernel launch parameters                                       |
+| `graph.cuh`         | CUDA graph construction and management                               |
+| `places.cuh`        | Execution/data affinity across multi-device systems                  |
+| `stf.cuh`           | Sequential Task Flow (STF) programming model                         |
+| `copy.cuh`          | Typed async copy                                                     |
+| `copy_bytes.cuh`    | Byte-wise mdspan host↔device copy                                    |
+| `green_context.cuh` | SM-partitioned green contexts (CUDA 12.4+)                           |
+| `group.cuh`         | Cooperative group abstractions with mappings                         |
+| `kernel.cuh`        | Kernel attribute introspection (`kernel_ref`)                        |
+| `library.cuh`       | Library context handle (`library_ref`)                               |
+| `cufile.cuh`        | cuFile integration (CUDA 12.9+, Linux only)                          |
+
+## Feature areas
+
+| Area               | Key types / entry point                              | Notes                                                                                                            |
+|--------------------|------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
+| Containers         | `uninitialized_buffer<T, Props...>`                  | Owning device storage, memory location encoded in properties                                                    |
+| Streams            | `stream`, `stream_ref`                               | Owning and non-owning `cudaStream_t` wrappers with RAII                                                         |
+| Memory resources   | stream-ordered `async_resource`, `graph_resource`   | Compatible with `libcudacxx` `cuda::mr::` framework                                                             |
+| Execution          | P2300 (`std::execution`) senders/schedulers          | `stream_context`, `sync_wait`, `bulk`, `when_all`; needs `-allow-unsupported-compiler`                          |
+| Launch             | `launch<Config>(kernel, args...)`                    | Typed launch with compile-time grid/block encoding                                                              |
+| Graph              | graph capture + node ops, `graph_buffer`            | Graph-lifetime allocations                                                                                       |
+| Places             | `exec_place`, `data_place`                           | Execution/data affinity across devices, green contexts, stream pools, multi-device grids; standalone (no STF required) |
+| STF                | Sequential Task Flow                                | Task-graph model with auto dependency tracking; large subproject, `cudax_ENABLE_CUDASTF`                        |
+| Copy / copy_bytes  | `copy`, typed mdspan transfer                        | Byte-wise and typed async copies, relaxed and strict ordering                                                   |
+| Green contexts     | `green_context_helper`                               | SM-partitioned sub-device contexts; CUDA 12.4+                                                                  |
+| Group              | cooperative algorithms + mappings                    | Group synchronizers, segmented algorithms                                                                        |
+| cuFile             | GDS integration                                      | Linux only; requires CUDA 12.9+                                                                                 |
+
+## Header conventions
+
+Mirrors `libcudacxx`. Public entry headers are `include/cuda/experimental/<name>.cuh`.
+Implementation detail headers live in `__<name>/` subdirectories and use the same
+`_CCCL_IMPLICIT_SYSTEM_HEADER_*` pragma guards. Include the top-level entry header
+only; never include detail headers directly.
+
+## Test suite
+
+Tests use Catch2 under `cudax/test/`. CTest targets are `cudax.test.<area>`,
+mirroring the include tree (e.g. `cudax.test.execution`, `cudax.test.containers`).
+STF and Places have separate CMake subdirectories (`test/stf/`, `test/places/`)
+with distinct build requirements. cuFile tests are conditional on
+`cudax_ENABLE_CUFILE` and require CUDA 12.9+ on Linux.
+
+## Cross-references
+
+- Build and run cudax targets → `cccl-build`, `cccl-test`
+- Stable stdlib layer that cudax features graduate into → `cccl-libcudacxx`
+
+## Additional resources
+
+- `references/docs.md` — index of cudax documentation (STF, memory resources, API reference).
+- `references/tools.md` — build and test scripts for cudax.
diff --git a/.agent/skills/cccl-cudax/references/docs.md b/.agent/skills/cccl-cudax/references/docs.md
new file mode 100644
index 00000000000..aa4d6920fea
--- /dev/null
+++ b/.agent/skills/cccl-cudax/references/docs.md
@@ -0,0 +1,21 @@
+# Documentation index — cccl-cudax
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cudax/index.rst` | Experimental features overview: STF, graph, memory resources, places. |
+| `docs/cudax/memory_resource.rst` | Async and stream-ordered memory resource abstraction (`cuda::mr`). |
+| `docs/cudax/stf.rst` | Sequential Task Flow (STF) programming model for CUDA kernels. |
+| `docs/cudax/stf/index.rst` | Auto-generated STF API documentation (and subdirectories). |
+| `docs/cudax/api/index.rst` | Full API reference for all cudax components (and subdirectories). |
+| `cudax/README.md` | Feature overview, compiler requirements (C++20, CUDA 12+), installation instructions. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cudax/container.rst` | Experimental container abstractions. |
+| `docs/cudax/graph.rst` | Graph construction and execution features. |
+| `docs/cudax/places.rst` | Execution and data affinity abstractions across devices. |
+| `examples/cudax/README.md` | Experimental features demonstration examples. |
diff --git a/.agent/skills/cccl-cudax/references/tools.md b/.agent/skills/cccl-cudax/references/tools.md
new file mode 100644
index 00000000000..054f804a145
--- /dev/null
+++ b/.agent/skills/cccl-cudax/references/tools.md
@@ -0,0 +1,9 @@
+# Tool index — cccl-cudax
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/build_cudax.sh` | Full-matrix cudax build (C++20 only). | `cccl-build` → `references/tools.md` |
+| `ci/test_cudax.sh` | Full-matrix cudax test; requires GPU. | `cccl-test` → `references/tools.md` |
+| `ci/util/build_and_test_targets.sh` | Targeted build+test for inner-loop iteration. | `cccl-build` → `references/build_and_test_targets_usage.md` |
diff --git a/.agent/skills/cccl-devcontainer/SKILL.md b/.agent/skills/cccl-devcontainer/SKILL.md
new file mode 100644
index 00000000000..d386f71720b
--- /dev/null
+++ b/.agent/skills/cccl-devcontainer/SKILL.md
@@ -0,0 +1,52 @@
+---
+description: |
+  CCCL's `.devcontainer/launch.sh` — launch a Docker container with a chosen CUDA toolkit
+  and host compiler, mount the repo, and run a shell or script. Linux-only. Covers flag
+  conventions, the already-in-container check, and the available CTK × host-compiler matrix.
+  Triggers: "run in devcontainer", "launch the container", "build with cuda 13.2", "open a shell with gcc 14", "start a devcontainer".
+---
+
+# cccl-devcontainer
+
+`.devcontainer/launch.sh` boots a Docker container preconfigured with a chosen CUDA toolkit
+and host compiler, mounts the repo, and either drops into a shell or runs a script.
+**Linux-only** — Linux host, Linux container. Windows / MSVC builds run outside the devcontainer.
+
+## Launch flags
+
+| Flag                     | Purpose                                  |
+|--------------------------|------------------------------------------|
+| `-d`, `--docker`         | Run without VSCode (required for agents) |
+| `--cuda <version>`       | CUDA toolkit (e.g. `13.2`, `12.9`)       |
+| `--cuda-ext`             | Image with extended CTK libraries        |
+| `--host <compiler>`      | Host compiler (`gcc14`, `clang17`)       |
+| `--gpus <request>`       | GPU passthrough (`all` for everything)   |
+| `-e`, `--env KEY=VAL`    | Inject env var                           |
+| `-v`, `--volume SRC:DST` | Mount additional path                    |
+| `-- <script> [args]`     | Run script inside container after setup  |
+
+```
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14
+.devcontainer/launch.sh -d --cuda 12.9 --host gcc13 -- ./ci/build_cub.sh -cxx g++ -std 17 -arch native
+.devcontainer/launch.sh -d --gpus all -- ./ci/util/build_and_test_targets.sh --preset cub-cpp20 --build-targets "cub.cpp20.test.iterator"
+```
+
+For targeted builds inside a container, route to `cccl-build`; for tests, `cccl-test`.
+
+## Already inside a container?
+
+`CCCL_BUILD_INFIX` is set inside the container. Check before launching:
+
+```
+echo "$CCCL_BUILD_INFIX"
+```
+
+Non-empty → already inside; run the command directly. Nested launches don't work.
+First launch pulls the image; subsequent launches use cache.
+
+## Additional resources
+
+- `references/regenerate.md` — when and how to regenerate devcontainer subdirs from `ci/matrix.yaml`.
+- `references/docs.md` — index of devcontainer documentation.
+- `references/tools.md` — devcontainer scripts with purpose and ownership.
+- `references/launch_usage.md` — `.devcontainer/launch.sh` interface and examples.
diff --git a/.agent/skills/cccl-devcontainer/references/docs.md b/.agent/skills/cccl-devcontainer/references/docs.md
new file mode 100644
index 00000000000..14227ab7fde
--- /dev/null
+++ b/.agent/skills/cccl-devcontainer/references/docs.md
@@ -0,0 +1,12 @@
+# Documentation index — cccl-devcontainer
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `.devcontainer/README.md` | Docker setup for Linux, WSL, and manual Docker usage; container naming convention; common launch patterns. |
+
+## See also
+
+- `cccl_detail-devcontainer-matrix` `references/` for devcontainer matrix generation from `ci/matrix.yaml`.
+- `cccl-build` `references/build_and_test_targets_usage.md` for the build/test command to run inside the container.
diff --git a/.agent/skills/cccl-devcontainer/references/launch_usage.md b/.agent/skills/cccl-devcontainer/references/launch_usage.md
new file mode 100644
index 00000000000..053fc2f8368
--- /dev/null
+++ b/.agent/skills/cccl-devcontainer/references/launch_usage.md
@@ -0,0 +1,83 @@
+# `.devcontainer/launch.sh` usage
+
+Launches a CCCL development container via Docker or VSCode. Selects the devcontainer config
+matching the requested CTK version and host compiler from the pre-generated
+`.devcontainer/{cuda-X.Y}-{compiler}/` subdirectory, then starts the container with the repo
+root bind-mounted at `/home/coder/cccl`.
+
+## Location
+
+`.devcontainer/launch.sh`. Run from the repo root. Requires Docker installed and running.
+Must use GNU `getopt` (standard on Linux; install via `brew install gnu-getopt` on macOS if needed).
+
+## Interface
+
+```
+Usage: .devcontainer/launch.sh [-c|--cuda <CUDA version>] [-H|--host <Host compiler>] [-d|--docker]
+Launch a development container. If no CUDA version or Host compiler are specified,
+the top-level devcontainer in .devcontainer/devcontainer.json will be used.
+
+Options:
+  -c, --cuda               Specify the CUDA version. E.g., 12.2
+  --cuda-ext               Use a docker image with extended CTK libraries.
+  -H, --host               Specify the host compiler. E.g., gcc12
+  -d, --docker             Launch the development environment in Docker directly without using VSCode.
+  --gpus gpu-request       GPU devices to add to the container ('all' to pass all GPUs).
+  -e, --env list           Set additional container environment variables.
+  -v, --volume list        Bind mount a volume.
+  -h, --help               Display this help message and exit.
+```
+
+## Options
+
+| Flag | Required? | Description |
+|------|-----------|-------------|
+| `-c` / `--cuda` | No | CTK version string, e.g. `12.2`, `13.2`. Must match an entry in `ci/matrix.yaml` devcontainers section. |
+| `--cuda-ext` | No | Use the extended CTK image variant (includes extra libraries like cuSPARSE). |
+| `-H` / `--host` | No | Host compiler name, e.g. `gcc12`, `gcc14`, `clang18`. Must match an entry in `ci/matrix.yaml`. |
+| `-d` / `--docker` | No | Launch directly via `docker run` instead of opening VSCode. Required for non-interactive / scripted use. |
+| `--gpus` | No | GPU device request passed to `docker run --gpus`. Use `all` to expose all GPUs. |
+| `-e` / `--env` | No | Extra environment variables for the container (repeatable; `VAR=value` format). |
+| `-v` / `--volume` | No | Extra bind mounts (repeatable; `host:container` format). |
+
+If neither `--cuda` nor `--host` is given, falls back to `.devcontainer/devcontainer.json` (the
+default, non-versioned devcontainer).
+
+Any arguments after `--` are passed as the command to run inside the container instead of the
+default shell.
+
+## Examples
+
+```bash
+# Open default devcontainer in VSCode
+.devcontainer/launch.sh
+
+# Run a build inside Docker with CTK 13.2 + GCC 14
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 -- \
+  ./ci/util/build_and_test_targets.sh \
+    --preset cub-cpp20 \
+    --build-targets "cub.cpp20.test.iterator"
+
+# Run a test that requires a GPU
+.devcontainer/launch.sh -d --cuda 13.2 --host gcc14 --gpus all -- \
+  ./ci/util/build_and_test_targets.sh \
+    --preset cub-cpp20 \
+    --ctest-targets "cub.cpp20.test.iterator"
+
+# Pass extra env vars (e.g., enable memory monitor)
+.devcontainer/launch.sh -d --cuda 12.9 --host gcc13 -e MEMMON=1 -- \
+  ./ci/build_cub.sh -std 17 -arch native
+```
+
+## Notes / gotchas
+
+- `CCCL_BUILD_INFIX` is set inside the container to isolate build artifacts by devcontainer
+  variant. Do not launch nested containers — check `echo "$CCCL_BUILD_INFIX"` first; non-empty
+  means you are already inside.
+- First launch pulls the Docker image (can be slow on first use; subsequent launches use cache).
+- `--gpus all` is required for any test that runs device kernels; build-only runs do not need it.
+- The `--` separator is required when passing a command; arguments before `--` are launch options,
+  arguments after `--` are passed verbatim to the container.
+- Devcontainer configs are generated by `.devcontainer/make_devcontainers.sh` from `ci/matrix.yaml`.
+  If a requested `--cuda`/`--host` combination has no matching config, the launch will fail with a
+  "no such file" error — check that the combination is in the matrix.
diff --git a/.agent/skills/cccl-devcontainer/references/regenerate.md b/.agent/skills/cccl-devcontainer/references/regenerate.md
new file mode 100644
index 00000000000..b5cab223c59
--- /dev/null
+++ b/.agent/skills/cccl-devcontainer/references/regenerate.md
@@ -0,0 +1,39 @@
+# Regenerating devcontainer subdirs
+
+## When to regenerate
+
+The per-combination subdirs under `.devcontainer/` (e.g. `.devcontainer/cuda13.2-gcc14/`) and
+their `devcontainer.json` files are **generated**. Direct edits to them are overwritten on the
+next regeneration run. Regenerate when:
+
+- Adding or removing a CUDA × host-compiler combination from `ci/matrix.yaml`.
+- Changing the base `.devcontainer/devcontainer.json` template.
+- Pruning stale subdirs that no longer match the matrix.
+
+## How to regenerate
+
+1. Edit `ci/matrix.yaml` — the `dc` entries (and `dc_ext` for extended-CTK images) control which
+   CUDA × host-compiler combinations exist.
+2. If the template itself needs changing, edit `.devcontainer/devcontainer.json` (the base template,
+   not a per-combination subdir).
+3. From the repo root:
+
+```
+.devcontainer/make_devcontainers.sh --clean
+```
+
+`--clean` prunes subdirs for combinations that no longer appear in the matrix.
+
+## What gets rewritten
+
+- `.devcontainer/cuda<version>-<host>/devcontainer.json` — one file per matrix combination.
+- Stale subdirs are removed when `--clean` is passed.
+- The base `.devcontainer/devcontainer.json` template is **not** modified by the script.
+
+## Validating after regeneration
+
+Push the regenerated files. CI's "Validate Devcontainer" jobs run automatically and confirm
+each generated `devcontainer.json` is well-formed.
+
+`[skip-vdc]` blocks those jobs. Do not use it on PRs that modify `.devcontainer/`, `ci/`, or
+`.github/`.
diff --git a/.agent/skills/cccl-devcontainer/references/tools.md b/.agent/skills/cccl-devcontainer/references/tools.md
new file mode 100644
index 00000000000..6017ec36374
--- /dev/null
+++ b/.agent/skills/cccl-devcontainer/references/tools.md
@@ -0,0 +1,16 @@
+# Tool index — cccl-devcontainer
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose | Detail |
+|------|---------|--------|
+| `.devcontainer/launch.sh` | Launch a devcontainer (Docker or VSCode) for a given CTK version + host compiler combination. The primary entry point for all devcontainer-based builds and tests. | `references/launch_usage.md` |
+| `.devcontainer/docker-entrypoint.sh` | Docker container startup hook: sets environment, sources sccache config, runs requested command. Invoked by Docker, not directly. | sourced by Docker; not user-invoked |
+| `.devcontainer/cccl-entrypoint.sh` | CCCL-specific container init: sets `CCCL_BUILD_INFIX`, sources build environment. Sourced inside the container. | sourced by container; not user-invoked |
+| `.devcontainer/verify_devcontainer.sh` | Verifies a named devcontainer config is well-formed and builds successfully. Used by the `verify-devcontainers` CI workflow. | CI-internal |
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `.devcontainer/make_devcontainers.sh` | Generates 60+ devcontainer subdirectory configs from `ci/matrix.yaml`. | `cccl_detail-devcontainer-matrix` → `references/tools.md` |
diff --git a/.agent/skills/cccl-devcontainers/SKILL.md b/.agent/skills/cccl-devcontainers/SKILL.md
deleted file mode 100644
index 4920e18417c..00000000000
--- a/.agent/skills/cccl-devcontainers/SKILL.md
+++ /dev/null
@@ -1,58 +0,0 @@
----
-name: cccl-devcontainers
-description: "Use CCCL's `.devcontainer/launch.sh` to run one-off bash sessions, builds, or tests inside a CCCL-configured container with a chosen CUDA toolkit and host compiler. Covers the `-d` / `--cuda` / `--host` / `--gpus` / `--env` / `--volume` argument conventions and the `CCCL_BUILD_INFIX` already-in-container check. Use when the user wants to build/test in a clean, reproducible environment, run a quick experiment with a specific toolchain, or escape from host environment problems. Trigger phrases: \"run in devcontainer\", \"launch the container\", \"build with cuda 13.2\", \"open a shell with gcc 14\"."
----
-
-# cccl-devcontainers
-
-`.devcontainer/launch.sh` boots a Docker container preconfigured with a chosen CUDA toolkit and host compiler,
-mounts the repo, and either drops into a shell or runs a script. **Linux-only** — Linux host, Linux container.
-Windows / MSVC builds run outside the devcontainer.
-
-## Flags
-
-| Flag                     | Purpose                                  |
-|--------------------------|------------------------------------------|
-| `-d`, `--docker`         | Run without VSCode (required for agents) |
-| `--cuda <version>`       | CUDA toolkit (e.g. `13.2`, `12.9`)       |
-| `--cuda-ext`             | Image with extended CTK libraries        |
-| `--host <compiler>`      | Host compiler (`gcc14`, `clang17`)       |
-| `--gpus <request>`       | GPU passthrough (`all` for everything)   |
-| `-e`, `--env KEY=VAL`    | Inject env var                           |
-| `-v`, `--volume SRC:DST` | Mount additional path                    |
-| `-- <script> [args]`     | Run script inside container after setup  |
-
-Examples:
-
-```
-.devcontainer/launch.sh -d --cuda 13.2 --host gcc14
-.devcontainer/launch.sh -d --cuda 12.9 --host gcc13 -- ./ci/build_cub.sh -cxx g++ -std 17 -arch native
-.devcontainer/launch.sh -d --gpus all -- ./ci/util/build_and_test_targets.sh --preset cub-cpp20 --build-targets "cub.cpp20.test.iterator"
-```
-
-## Already inside a container?
-
-`CCCL_BUILD_INFIX` is set inside the container. Before launching:
-
-```
-echo "$CCCL_BUILD_INFIX"
-```
-
-Non-empty → already inside; run the command directly. Nested launches don't work.
-
-First launch pulls the image; subsequent launches use cache.
-
-## Updating devcontainers
-
-Per-combination subdirs (`.devcontainer/cuda<version>-<host>/`) and their `devcontainer.json` files are
-**generated** — direct edits get overwritten. To change the set of available containers:
-
-1. Edit `ci/matrix.yaml` — the `dc` (and `dc_ext` for extended-CTK) entries control which CUDA × host-compiler
-   combinations exist.
-2. If the template itself needs changing, edit the base `.devcontainer/devcontainer.json`.
-3. Run `.devcontainer/make_devcontainers.sh --clean` from the repo root to regenerate per-combination subdirs and
-   prune stale ones.
-4. Push; CI's "Validate Devcontainer" jobs run.
-
-`[skip-vdc]` blocks Validate Devcontainer jobs. Don't use it on PRs that modify `.devcontainer/`, `ci/`, or
-`.github/`.
diff --git a/.agent/skills/cccl-docs/SKILL.md b/.agent/skills/cccl-docs/SKILL.md
new file mode 100644
index 00000000000..22e71681b75
--- /dev/null
+++ b/.agent/skills/cccl-docs/SKILL.md
@@ -0,0 +1,94 @@
+---
+description: |
+  CCCL's documentation system — Sphinx pages + Doxygen API extraction + per-library subtrees.
+  Covers the docs/ layout, build script, deploy workflow, and Breathe integration.
+  Triggers: "how do I build the docs", "how do docs deploy", "where are the docs sources",
+  "sphinx docs", "doxygen api docs".
+---
+
+# Documentation
+
+CCCL's documentation is a Sphinx site with Doxygen-generated API references bridged via Breathe. The source lives entirely under `docs/`. There is no CMake-based doc target — the build entry point is a single shell script.
+
+## Layout
+
+```
+docs/
+├── conf.py                 ← Sphinx configuration (extensions, Breathe projects, theme)
+├── requirements.txt        ← Python deps (sphinx, breathe, myst-parser, nvidia-sphinx-theme, …)
+├── gen_docs.bash           ← build entry point
+├── scrape_docs.bash        ← post-build page-list scraper
+├── index.rst               ← site root (cpp, python, maintainers)
+├── cpp.rst                 ← C++ library landing page
+├── cccl/                   ← cross-library guides (contributing, migration, macros, dev/)
+├── libcudacxx/             ← libcudacxx docs + Doxyfile
+├── cub/                    ← CUB docs + Doxyfile
+├── thrust/                 ← Thrust docs + Doxyfile
+├── cudax/                  ← cudax docs + Doxyfile
+├── python/                 ← Python package docs (compute, coop)
+├── maintainers/            ← branching, backport, coderabbit guides
+├── _ext/auto_api_generator.py  ← custom Sphinx extension (generates API pages from Doxygen XML)
+└── _build/                 ← generated output (gitignored)
+```
+
+Each C++ library (`cub`, `thrust`, `libcudacxx`, `cudax`) has its own `Doxyfile` that extracts XML into `docs/_build/doxygen/<lib>/xml/`. Breathe reads that XML; `auto_api_generator.py` drives page generation.
+
+## Building locally
+
+Run from the repo root (Linux only; needs `cmake`, `ninja`, `flex`, `bison`, `python3-venv`):
+
+```
+./docs/gen_docs.bash
+```
+
+Pass `--allow-dep-install` to auto-install missing system packages via `apt`. The script:
+
+1. Builds Doxygen 1.9.6 from source on first run (cached under `docs/_build/doxygen-build/`).
+2. Creates a Python venv at `docs/env/` and installs `docs/requirements.txt`.
+3. Runs each library's Doxygen build in parallel.
+4. Runs `sphinx.cmd.build -b html`.
+5. Reorganises output into `docs/_build/html/<VERSION>/` and writes `nv-versions.json`.
+
+To build inside a container, route to `cccl-devcontainer` first. The custom Doxygen build takes several minutes on first run; subsequent runs reuse the cached binary.
+
+Clean the build output:
+
+```
+./docs/gen_docs.bash clean        # removes _build/html
+./docs/gen_docs.bash clean --all  # also removes cached Doxygen source + binary
+```
+
+## Deploy workflow
+
+`.github/workflows/docs-deploy.yml` — triggers on push to `main` and `workflow_dispatch`.
+
+- `main` → publishes under `docs/unstable/` with `is_latest=true`.
+- `branch/X.Y` → publishes under `docs/X.Y/`.
+- `destination_override` input overrides the target directory (useful for fork testing).
+
+The workflow delegates to `.github/actions/docs-build/action.yml`, which calls `gen_docs.bash --allow-dep-install` and copies `docs/_build/html/*` to `_site/`. The deploy step (`peaceiris/actions-gh-pages`) pushes to the `gh-pages` branch with `keep_files: true`.
+
+Published site: `https://nvidia.github.io/cccl/`.
+
+## Sphinx + Doxygen integration
+
+Breathe is the bridge. `conf.py` declares four Breathe projects (`cub`, `thrust`, `libcudacxx`, `cudax`), each pointing at its Doxygen XML directory. The custom extension `_ext/auto_api_generator.py` walks the XML and generates individual `.rst` pages per API symbol, skipping a small set of symbols that Breathe cannot parse (`_BREATHE_SKIP_SYMBOLS`).
+
+Exhale is listed in `requirements.txt` but disabled in `conf.py` (build timeouts). All API page generation goes through `auto_api_generator`.
+
+Theme: `nvidia-sphinx-theme`. Version switcher JSON: `nv-versions.json` at the site root.
+
+## Per-library API doc structure
+
+| Library    | Doxyfile                      | Input headers                                            |
+|------------|-------------------------------|----------------------------------------------------------|
+| CUB        | `docs/cub/Doxyfile`           | `cub/cub/**` (excludes `detail/`, `dispatch/`, `kernels/`) |
+| Thrust     | `docs/thrust/Doxyfile`        | Thrust public headers                                   |
+| libcudacxx | `docs/libcudacxx/Doxyfile`    | libcudacxx public headers                               |
+| cudax      | `docs/cudax/Doxyfile`         | cudax public headers                                    |
+
+Python API docs use Sphinx `autodoc` + `autosummary` (no Doxygen). Source is `python/cuda_cccl/`.
+
+## Additional resources
+
+- `references/doxygen-breathe-gotchas.md` — known parse failures, `_BREATHE_SKIP_SYMBOLS`, suppressed warnings.
diff --git a/.agent/skills/cccl-docs/references/doxygen-breathe-gotchas.md b/.agent/skills/cccl-docs/references/doxygen-breathe-gotchas.md
new file mode 100644
index 00000000000..4419c4b8e60
--- /dev/null
+++ b/.agent/skills/cccl-docs/references/doxygen-breathe-gotchas.md
@@ -0,0 +1,36 @@
+# Doxygen + Breathe gotchas
+
+## Suppressed warnings
+
+`conf.py` suppresses two warning categories that cannot be fixed at the source level:
+
+- `cpp.duplicate_declaration` — Breathe walks each Doxygen XML file independently. Symbols appearing in both a namespace XML and a class/group XML are emitted twice. No workaround on the CCCL side.
+- `docutils` — Breathe renders complex C++ default arguments and SFINAE expressions as RST that docutils cannot parse (mismatched inline literals, unexpected braces). The C++ is valid; the RST Breathe emits is not.
+
+## `_BREATHE_SKIP_SYMBOLS`
+
+Defined in `docs/_ext/auto_api_generator.py`. API pages are not generated for these symbols to prevent unfixable build failures under `-W`:
+
+| Symbol                                    | Reason                                                                                                                      |
+|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
+| `get_executor_func_t`                     | Function-pointer typedef; Breathe renders as `pos4(*)(pos4, dim4, dim4)` — Sphinx C++ parser rejects "Expected end of definition". |
+| `cuda::experimental::stf::get_executor_func_t` | Same (qualified form).                                                                                                      |
+| `partition_fn_t`                          | Same issue as `get_executor_func_t`.                                                                                         |
+| `cuda::experimental::stf::partition_fn_t` | Same (qualified form).                                                                                                      |
+| `property_with_value`                     | Variable template using `_CCCL_REQUIRES_EXPR` with `typename(...)` — Sphinx cannot parse the requires-expression expansion. |
+| `has_property`                            | Variable template with `_CCCL_REQUIRES_EXPR` containing `const` — same parse failure.                                       |
+| `cuda::experimental::group`               | Variable with complex type expression Sphinx C++ parser cannot handle.                                                       |
+
+To add a new skip: append to `_BREATHE_SKIP_SYMBOLS` in `docs/_ext/auto_api_generator.py`. Both the qualified and unqualified forms may need to be listed (the generator uses different naming conventions per project).
+
+## Exhale disabled
+
+`exhale` is in `requirements.txt` but commented out in `conf.py`. It was disabled due to build timeouts. All API page generation goes through `auto_api_generator` instead.
+
+## Doxygen version pinning
+
+`gen_docs.bash` builds Doxygen 1.9.6 from source on first run and caches it under `docs/_build/doxygen-build/`. This pins the version for consistent output. If the cached binary is present, it is used unconditionally. To rebuild: `./docs/gen_docs.bash clean --all`.
+
+## CUDA macro attributes
+
+`conf.py` registers CCCL-specific macros as `cpp_id_attributes` so Breathe/Sphinx does not reject declarations annotated with `__device__`, `_CCCL_HOST_DEVICE`, `_CCCL_API`, and similar. If a new CCCL macro causes parse failures, add it to `cpp_id_attributes` or `cpp_paren_attributes` in `conf.py`.
diff --git a/.agent/skills/cccl-infra/SKILL.md b/.agent/skills/cccl-infra/SKILL.md
new file mode 100644
index 00000000000..ce44c97b4b4
--- /dev/null
+++ b/.agent/skills/cccl-infra/SKILL.md
@@ -0,0 +1,65 @@
+---
+description: |
+  Cross-functional infrastructure maintenance — tasks that fan out across CI, devcontainers,
+  CMake, and release tooling simultaneously. Covers CTK bumps, host-compiler additions,
+  release cuts, and project add/remove.
+  Triggers: "bump CTK", "add compiler version", "cut a release", "add a new project",
+  "infra overview", "what touches the matrix".
+---
+
+Entry point for cross-cutting infrastructure work. Individual subsystem questions go to
+the per-subsystem skills listed below; this skill handles tasks that touch more than one
+at once and provides the ordered playbooks for each.
+
+## Subsystem map
+
+| Subsystem                      | Skill                              | Canonical file(s)                                                                     |
+|--------------------------------|------------------------------------|---------------------------------------------------------------------------------------|
+| CI matrix / job dispatch       | `cccl-ci`                          | `ci/matrix.yaml`, `.github/actions/workflow-build/build-workflow.py`                   |
+| Devcontainer generation        | `cccl-devcontainer`                | `.devcontainer/make_devcontainers.sh`, `devcontainer.json`                            |
+| CMake presets / options        | `cccl-cmake`                       | `CMakePresets.json`                                                                   |
+| Pre-commit linters             | `cccl-precommit`                   | `.pre-commit-config.yaml`                                                             |
+| Release workflows              | `cccl_detail-release`              | `ci/update_version.sh`, `cccl-version.json`, `.github/workflows/release-*.yml`         |
+| GitHub workflow templates      | `cccl_detail-github`               | `.github/workflows/`, `.github/actions/`                                              |
+| Examples / packaging           | `cccl_detail-examples`             | `examples/`, `test/cmake/`, `ci/test/`                                                |
+| Devcontainer matrix deep-dive  | `cccl_detail-devcontainer-matrix`  | `.devcontainer/make_devcontainers.sh`, `ctk_versions` / `host_compilers` in `ci/matrix.yaml` |
+
+## Task fanout
+
+| Task              | Files touched                                                                                                                                                       | Playbook                       |
+|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|
+| Add a CTK version | `ci/matrix.yaml` (`ctk_versions`, `devcontainer_version`, workflow rows), `.devcontainer/` (regen), `.github/workflows/verify-devcontainers.yml`                  | `references/ctk-bump.md`       |
+| Add/remove a host compiler | `ci/matrix.yaml` (`host_compilers`, `cuda99_*_version`, workflow rows), `.devcontainer/` (regen)                                                           | `references/compiler-bump.md`  |
+| Cut a release     | `cccl-version.json`, `ci/update_version.sh` targets, `lib/cmake/*/`, per-library `version.h` / `.cuh`, `docs/VERSION.md`, `python/cuda_cccl/cuda/cccl/_version.py`, `.github/workflows/release-*.yml` | `references/release-cut.md`    |
+| Add a project     | `ci/matrix.yaml` (workflow rows, `jobs:` section), `ci/project_files_and_dependencies.yaml` (new project key + dependency chain), `CMakePresets.json` (new preset), `ci/build_<proj>.sh` + `ci/test_<proj>.sh` | `references/project-add.md`    |
+| Remove a project  | Inverse of add: remove matrix rows, yaml entry, presets, build/test scripts, update dependents                                                                    | `references/project-add.md`    |
+
+## Key matrix.yaml facts
+
+`ctk_versions` maps real CTK versions to aliases. `12.X` and `13.X` are aliases for the newest
+patch release in each major line — update the alias target when adding a new patch. `devcontainer_version`
+at the top of `matrix.yaml` pins the rapidsai/devcontainers image tag; bump it alongside new
+CTK/compiler additions.
+
+After any edit to `ctk_versions` or `host_compilers`, regenerate devcontainers:
+
+```
+cd .devcontainer
+bash make_devcontainers.sh --clean
+```
+
+## Hard prohibitions
+
+- Never merge a PR while `workflows.override` is non-empty in `ci/matrix.yaml`.
+- Never merge a PR with `[skip-*]` tags in the last commit message.
+- Never edit individual `.devcontainer/<name>/devcontainer.json` files by hand — always regenerate via `make_devcontainers.sh`.
+- Never bump `cccl-version.json` by hand — use `ci/update_version.sh` or the `update-branch-version.yml` workflow.
+
+## Additional resources
+
+- `references/ctk-bump.md` — ordered checklist for adding a new CTK version
+- `references/compiler-bump.md` — ordered checklist for adding a new host compiler
+- `references/release-cut.md` — release cycle steps (branch cut through finalization)
+- `references/project-add.md` — adding or removing a CCCL project
+- `references/docs.md` — index of maintainer/infra documentation.
+- `references/tools.md` — infra scripts (version management, utilities, downstream testing).
diff --git a/.agent/skills/cccl-infra/references/compiler-bump.md b/.agent/skills/cccl-infra/references/compiler-bump.md
new file mode 100644
index 00000000000..f9af4d9c7de
--- /dev/null
+++ b/.agent/skills/cccl-infra/references/compiler-bump.md
@@ -0,0 +1,64 @@
+# Compiler bump checklist
+
+Adding a new host compiler version (e.g., GCC 16, Clang 22) to the CCCL matrix.
+
+## Step 1 — Update `ci/matrix.yaml`
+
+1. Add the version entry under `host_compilers.<family>.versions`:
+   ```yaml
+   host_compilers:
+     gcc:
+       versions:
+         16: { stds: [17, 20] }
+   ```
+   Set `stds` to the C++ standards the compiler supports. For MSVC, the version key is the
+   toolset version (e.g., `'14.50'`); add a human-friendly alias (e.g., `alias: '2026'`).
+
+2. If this is the new "latest" of its family, update the bare family alias in workflow rows
+   where `cxx: 'gcc'` / `cxx: 'clang'` / `cxx: 'msvc'` resolves to "latest". No explicit
+   file change needed — the alias resolution is automatic via the `alias` field — but check
+   whether `cuda99_gcc_version` or `cuda99_clang_version` should be bumped to the new version.
+
+3. Update workflow rows in `pull_request`, `nightly`, and `weekly` to cover the new compiler:
+   - Add to oldest/newest bracketing rows at PR level.
+   - Add to full-matrix rows at nightly/weekly.
+   - For compilers that only work with certain CTK versions, include explicit `ctk:` constraints.
+
+4. If the new compiler is the default for the `cuda99.X` internal containers, update
+   `cuda99_gcc_version` or `cuda99_clang_version` at the top of `matrix.yaml`.
+
+5. Add the compiler to `devcontainer_verify` rows if applicable (paired with supported CTK
+   versions). Pattern: `{jobs: ['dc'], ctk: [...], cxx: ['<family><version>']}`.
+
+## Step 2 — Regenerate devcontainers
+
+```bash
+cd .devcontainer
+bash make_devcontainers.sh --clean
+```
+
+New directories appear for each `cuda<ctk>-<compiler><version>` combination. Verify that
+the expected image tag exists in rapidsai/devcontainers before merging.
+
+## Step 3 — Validate build support
+
+Run a targeted build using `ci/util/build_and_test_targets.sh` or the override matrix to
+confirm the new compiler actually works with the selected CTK versions before enabling it
+in the full nightly matrix.
+
+## Step 4 — Remove dropped compilers (if applicable)
+
+When dropping an old compiler version:
+1. Remove its entry from `host_compilers.<family>.versions`.
+2. Remove all workflow rows that explicitly reference it (`cxx: ['gcc7']` etc.).
+3. Run `make_devcontainers.sh --clean` — old subdirectories are removed.
+4. Update `CONTRIBUTING.md` if the supported compiler range changed.
+
+## Files touched summary
+
+| File                                | Change                                                   |
+|-------------------------------------|----------------------------------------------------------|
+| `ci/matrix.yaml`                    | `host_compilers`, `cuda99_*_version`, `devcontainer_verify` rows, `pull_request` / `nightly` / `weekly` rows |
+| `.devcontainer/<name>/devcontainer.json` | Auto-generated by `make_devcontainers.sh`                |
+| `.devcontainer/devcontainer.json`   | Auto-updated (default container, if new compiler becomes default) |
+| `CONTRIBUTING.md`                   | Supported compiler range (if changed)                    |
diff --git a/.agent/skills/cccl-infra/references/ctk-bump.md b/.agent/skills/cccl-infra/references/ctk-bump.md
new file mode 100644
index 00000000000..c7159110708
--- /dev/null
+++ b/.agent/skills/cccl-infra/references/ctk-bump.md
@@ -0,0 +1,65 @@
+# CTK bump checklist
+
+Adding a new CUDA Toolkit version (e.g., 13.3) to the CCCL matrix.
+
+## Step 1 — Update `ci/matrix.yaml`
+
+1. Add the new version to `ctk_versions`:
+   ```yaml
+   ctk_versions:
+     13.3: { stds: [17, 20], alias: ['13.X'] }
+   ```
+   If this is the new "latest" patch of a major line, move the `13.X` alias from the old
+   entry to the new one. The old entry keeps its explicit version string, no alias.
+
+2. If the new CTK ships in a new NVHPC SDK, update the `nvhpc` / `nvhpc-prev` aliases in
+   `ctk_versions` and the corresponding version entries in `host_compilers.nvhpc.versions`.
+   The comment `!! Update the ctk_versions 'nvhpc*' aliases` marks the spot.
+
+3. Update `devcontainer_version` if rapidsai/devcontainers published new images for this CTK.
+
+4. Update `cuda99_gcc_version` / `cuda99_clang_version` if you want the internal cuda99.X
+   containers to track newer compilers.
+
+5. Add workflow rows for the new CTK to `pull_request`, `nightly`, and `weekly` sections.
+   Follow the existing pattern: oldest/newest host compilers at PR level, full matrix at
+   nightly/weekly. For a new major CTK, also add rows to `devcontainer_verify`.
+
+6. If the new CTK drops support for an older host compiler (or vice versa), remove those
+   combinations from any workflow rows that reference them.
+
+## Step 2 — Regenerate devcontainers
+
+```bash
+cd .devcontainer
+bash make_devcontainers.sh --clean
+```
+
+Inspect the diff. New subdirectories should appear for each `cuda<version>-<compiler>` combination.
+Stale subdirectories for removed combinations are deleted by `--clean`.
+
+## Step 3 — Verify devcontainer images exist
+
+Check that `rapidsai/devcontainers:<devcontainer_version>-cpp-<compiler>-cuda<version>` actually
+exists on Docker Hub before merging. The `verify-devcontainers.yml` workflow does this in CI, but
+a quick manual check prevents a broken merge.
+
+## Step 4 — Test the matrix
+
+Use the `workflows.override` section of `ci/matrix.yaml` to run a targeted subset of the new CTK
+jobs on your PR before enabling the full matrix. See `ci-overview.md` for override syntax.
+Remove the override before merging.
+
+## Step 5 — Update CONTRIBUTING.md / docs
+
+If the supported CTK range changed (new minimum or maximum), update `CONTRIBUTING.md` and any
+docs that enumerate supported CTK versions.
+
+## Files touched summary
+
+| File                                | Change                                                   |
+|-------------------------------------|----------------------------------------------------------|
+| `ci/matrix.yaml`                    | `ctk_versions`, `devcontainer_version`, `cuda99_*_version`, workflow rows |
+| `.devcontainer/<name>/devcontainer.json` | Auto-generated by `make_devcontainers.sh`                |
+| `.devcontainer/devcontainer.json`   | Auto-updated (default container)                         |
+| `CONTRIBUTING.md`                   | Supported CTK range (if changed)                         |
diff --git a/.agent/skills/cccl-infra/references/docs.md b/.agent/skills/cccl-infra/references/docs.md
new file mode 100644
index 00000000000..379a71af4a3
--- /dev/null
+++ b/.agent/skills/cccl-infra/references/docs.md
@@ -0,0 +1,21 @@
+# Documentation index — cccl-infra
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/maintainers/index.rst` | Maintainer policies and procedures landing page. |
+| `docs/maintainers/how_tos/index.rst` | Practical maintainer guides: release, backport, compiler bump, and other maintenance tasks. |
+| `docs/maintainers/references/index.rst` | Reference material: branching strategy, CodeRabbit config, backport process. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/maintainers/branching_strategy.rst` | Main, release, and development branch conventions. |
+| `docs/maintainers/backport_process.rst` | Cherry-picking bugfixes to release branches. |
+
+## See also
+
+- `cccl_detail-release` `references/docs.md` for versioning and release workflow documentation.
+- `cccl_detail-github` `references/docs.md` for GitHub-side infrastructure documentation.
diff --git a/.agent/skills/cccl-infra/references/project-add.md b/.agent/skills/cccl-infra/references/project-add.md
new file mode 100644
index 00000000000..bfe5029baa9
--- /dev/null
+++ b/.agent/skills/cccl-infra/references/project-add.md
@@ -0,0 +1,89 @@
+# Adding or removing a CCCL project
+
+## Adding a project
+
+### 1. Register in `ci/project_files_and_dependencies.yaml`
+
+Add one or more project keys. Common pattern: a `_public` key for the public API files
+and an `_internal` key for tests/infra that sets `matrix_project` (the name used in
+`ci/matrix.yaml` workflow rows).
+
+```yaml
+myproject_public:
+  name: "MyProject Public API"
+  lite_dependencies: [libcudacxx_public]
+  full_dependencies: []
+  include_regexes: ["myproject/include/"]
+
+myproject_internal:
+  name: "MyProject Tests/Infra"
+  matrix_project: "myproject"
+  lite_dependencies: []
+  full_dependencies: [myproject_public]
+  include_regexes: ["myproject/"]
+  exclude_project_files: [myproject_public]
+```
+
+Projects without `matrix_project` are internal-only; they affect dependency tracking but
+do not appear in `FULL_BUILD` / `LITE_BUILD` outputs. Projects that downstream projects
+depend on should have their public key added to those projects' `lite_dependencies` or
+`full_dependencies`.
+
+For the `core` project: any dirty files not matched by any project trigger a full rebuild.
+Infra files (CMake, ci/, AGENTS.md, etc.) fall here by default.
+
+### 2. Add build and test scripts
+
+Create `ci/build_<matrix_project>.sh` and `ci/test_<matrix_project>.sh` following the
+existing patterns (e.g., `ci/build_cub.sh`, `ci/test_cub.sh`). Windows variants go under
+`ci/windows/build_<matrix_project>.ps1` / `ci/windows/test_<matrix_project>.ps1` if needed.
+
+### 3. Add a CMake preset
+
+Add a preset in `CMakePresets.json` for the new project. Inherit from an appropriate base
+preset. See `cccl-cmake` for preset conventions.
+
+### 4. Add workflow rows in `ci/matrix.yaml`
+
+Add rows to `pull_request`, `nightly`, and `weekly` sections referencing the `matrix_project`
+name. Mirror the structure of a similar existing project.
+
+Example:
+```yaml
+- {jobs: ['build'], project: 'myproject', std: 'minmax', cxx: ['gcc', 'clang', 'msvc']}
+- {jobs: ['test'],  project: 'myproject', std: 'max',    cxx: ['gcc', 'clang'], gpu: 'rtx2080'}
+```
+
+### 5. Add to `tidy` dependencies (optional)
+
+If the project has C++ headers that should be checked by clang-tidy, add both public and
+internal keys to `tidy.full_dependencies` in `ci/project_files_and_dependencies.yaml`.
+
+### 6. Update `ignore_regexes` if needed
+
+If any files in the new project directory should not trigger CI (e.g., pure scripts,
+benchmarks), add matching regexes to `ignore_regexes` at the bottom of
+`ci/project_files_and_dependencies.yaml`.
+
+## Removing a project
+
+1. Remove all workflow rows referencing the project from `ci/matrix.yaml`.
+2. Remove the project's keys from `ci/project_files_and_dependencies.yaml`.
+3. Remove the project from `tidy.full_dependencies` if present.
+4. Delete `ci/build_<project>.sh`, `ci/test_<project>.sh`, and Windows variants.
+5. Remove the CMake preset from `CMakePresets.json`.
+6. Remove the project directory and any top-level CMakeLists references.
+7. Update `CONTRIBUTING.md` and any docs that list the project.
+
+## Files touched summary
+
+| File                            | Change                              |
+|---------------------------------|-------------------------------------|
+| `ci/project_files_and_dependencies.yaml` | New project keys, dependency chains |
+| `ci/matrix.yaml`                | Workflow rows for new project       |
+| `CMakePresets.json`             | New preset                          |
+| `ci/build_<project>.sh`         | New build script                    |
+| `ci/test_<project>.sh`          | New test script                     |
+| `ci/windows/build_<project>.ps1` | New Windows build script (if needed) |
+| `ci/windows/test_<project>.ps1` | New Windows test script (if needed)  |
+| `CONTRIBUTING.md`               | Updated project list                |
diff --git a/.agent/skills/cccl-infra/references/release-cut.md b/.agent/skills/cccl-infra/references/release-cut.md
new file mode 100644
index 00000000000..1605af1e906
--- /dev/null
+++ b/.agent/skills/cccl-infra/references/release-cut.md
@@ -0,0 +1,61 @@
+# Release cut checklist
+
+Three GitHub Actions workflows orchestrate CCCL releases. Run them in order.
+
+## Phase 0 — Version bump on main (pre-release)
+
+Trigger `.github/workflows/update-branch-version.yml` (workflow: "Release: 0. Update version
+in target branch") against `main` with the *next* version (e.g., `3.5.0`). This runs
+`ci/update_version.sh` and opens a PR.
+
+Files updated by `ci/update_version.sh`:
+| File                                       | Field           |
+|--------------------------------------------|-----------------|
+| `cccl-version.json`                        | `full`, `major`, `minor`, `patch` |
+| `libcudacxx/include/cuda/std/__cccl/version.h` | `CCCL_VERSION`  |
+| `thrust/thrust/version.h`                  | `THRUST_VERSION` |
+| `cub/cub/version.cuh`                      | `CUB_VERSION`   |
+| `lib/cmake/cccl/cccl-config-version.cmake` | version vars    |
+| `lib/cmake/cub/cub-config-version.cmake`   | version vars    |
+| `lib/cmake/libcudacxx/libcudacxx-config-version.cmake` | version vars |
+| `lib/cmake/thrust/thrust-config-version.cmake` | version vars |
+| `lib/cmake/cudax/cudax-config-version.cmake` | version vars |
+| `python/cuda_cccl/cuda/cccl/_version.py`   | `__version__`   |
+| `docs/VERSION.md`                          | major.minor     |
+
+Merge the PR from this workflow before proceeding.
+
+## Phase 1 — Begin release cycle
+
+Trigger `.github/workflows/release-create-new.yml` (workflow: "Release: 1. Begin Release
+Cycle") from the commit on `main` that should be the release base. Provide `main_version`
+(the next version for `main` after this release, e.g., `3.6.0`).
+
+This workflow:
+1. Creates `branch/3.5.x` from the selected ref.
+2. Updates `main` to `main_version`.
+3. Opens PRs for both operations.
+
+## Phase 2 — RC updates (optional)
+
+For release candidates, trigger `.github/workflows/release-update-rc.yml` on the release
+branch. Provide the RC version string (e.g., `3.5.0rc1`). Opens a PR against the release
+branch.
+
+## Phase 3 — Finalize
+
+Trigger `.github/workflows/release-finalize.yml` on the release branch. This workflow
+publishes the final tag and release artifacts.
+
+## Phase 4 — Wheel builds
+
+Trigger `.github/workflows/release-wheels.yml` after finalization to build and publish
+Python wheel artifacts.
+
+## Key invariants
+
+- Always verify `cccl-version.json` is correct before triggering Phase 1 — the workflow
+  reads it as the current version.
+- The `update-branch-version.yml` workflow is the only safe way to bump version numbers;
+  never edit version files by hand across the 10+ locations they appear.
+- Release branches follow the pattern `branch/{major}.{minor}.x`.
diff --git a/.agent/skills/cccl-infra/references/tools.md b/.agent/skills/cccl-infra/references/tools.md
new file mode 100644
index 00000000000..49cfd92c797
--- /dev/null
+++ b/.agent/skills/cccl-infra/references/tools.md
@@ -0,0 +1,28 @@
+# Tool index — cccl-infra
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose |
+|------|---------|
+| `ci/update_version.sh` | Updates `cccl-version.json` and version header files to a new version string. Rejects downgrades. |
+| `ci/generate_version.sh` | Regenerates `version.h` and `version.cuh` header files from `cccl-version.json`. |
+| `ci/update_rapids_version.sh` | Updates RAPIDS-specific version constraints in `ci/rapids/`. |
+| `ci/util/memmon.sh` | Memory usage monitor: polls RSS + swap at configurable intervals and logs peaks. Used in `build_common.sh` for CI build monitoring. |
+| `ci/util/retry.sh` | Retries a command N times with configurable backoff. Used in CI for flaky network operations. |
+| `ci/util/version_compare.sh` | Compares two semantic version strings (`X.Y.Z`). Used in CI scripts for version guards. |
+| `ci/util/extract_switches.sh` | Extracts boolean flag arguments from a command line (e.g. `-lid0`, `-no-lid`). Sourced by per-project build scripts. |
+| `ci/util/manifest.sh` | Creates and validates build artifact manifests for CI artifact tracking. |
+| `ci/util/create_mock_job_env.sh` | Creates a mock GHA job environment for local testing of CI scripts. |
+| `ci/pyenv_helper.sh` | Manages Python virtual environments for CCCL CI (installs packages, activates venvs). |
+| `ci/verify_codegen_libcudacxx.sh` | Verifies libcudacxx code generation output for all supported architectures. |
+| `ci/install_packaging.sh` | Installs CPM and packaging dependencies for downstream consumption tests. |
+| `ci/install_cccl.sh` | Installs CCCL headers and Python wheels to a target system path. |
+| `ci/nvrtc_libcudacxx.sh` | Verifies libcudacxx compiles via NVRTC (runtime compilation path). |
+
+## Notes
+
+- `ci/util/memmon.sh` is the only user-visible monitoring utility; it can be enabled locally
+  by setting `MEMMON=1` in the environment when invoking a build script.
+- `ci/util/retry.sh` and `ci/util/extract_switches.sh` are sourced libraries, not standalone tools.
+- Version scripts (`update_version.sh`, `generate_version.sh`) should be invoked via the
+  `update-branch-version.yml` workflow rather than directly — the workflow handles branching and PR creation.
diff --git a/.agent/skills/cccl-libcudacxx-style/SKILL.md b/.agent/skills/cccl-libcudacxx-style/SKILL.md
deleted file mode 100644
index 48eec852094..00000000000
--- a/.agent/skills/cccl-libcudacxx-style/SKILL.md
+++ /dev/null
@@ -1,94 +0,0 @@
----
-name: cccl-libcudacxx-style
-description: Make the code in libcudacxx/include, cudax/include compliant with the coding style
----
-
-# libcudacxx Style
-
-## Naming style
-
-- Macros: macro style, e.g. `MY_MACRO`.
-- Template parameters: CamelCase, e.g. `MyParameter`.
-- All other symbols: snake style, e.g. `my_variable`.
-
-All non-public symbols must be C++ reserved identifiers:
-
-- `_` for macros and template parameters, e.g. `_MY_MACRO`., `_MyParameter`.
-- `__` for all other symbols, e.g. `__my_variable`.
-
-- Avoid single letter names for template parameters. Wrong: `_T`, correct: `_Tp`.
-
-## Variables
-
-- All variables that are not modified must use `const`. This includes variables initialized by casts (`static_cast`, `reinterpret_cast`, `bit_cast`), function return values, and loop-invariant computations.
-- All variables that can be evaluated at compile-time must use `constexpr`.
-- All `constexpr` variables at namespace/global scope must use `inline`, including `template` variables.
-- Consider using plural names for array, span, list, e.g. `int values[4]` instead of `int value[4]`.
-
-## Function
-
-Declaration/Definition:
-
-- All functions must be marked `_CCCL_HOST_API`, `_CCCL_DEVICE_API`, or `_CCCL_API`.
-- Non-template, non-`constexpr` functions must use `inline`.
-- Most functions with a non-void return type shall use `[[nodiscard]]`. Exceptions are functions with known side effects, e.g. `cuda::std::copy`
-- All functions that don't throw exception must use `noexcept`
-- `constexpr` must be used for all functions that don't depend on run-time features, e.g. pointers.
-- If the return type is not explicit (`auto`), then a trailing return type is strongly preferred, e.g. `auto abs(float) -> float`
-
-Function call:
-
-- All calls to free functions must be fully qualified starting from the global namespace, e.g. `::cuda::ceil_div`. This includes calls to functions defined in the same namespace, e.g. inside `cuda::`, call `::cuda::ceil_div(...)`, not `ceil_div(...)`. This does not apply to (static) member functions of classes.
-
-## Types
-
-- Type names must be fully qualified, except when they are already declared in the current namespace.
-- This includes standard integer type aliases (`::cuda::std::size_t`, `::cuda::std::uintptr_t`, `::cuda::std::int32_t`, etc.) and any other `cuda::std` or standard library types. A local `using` declaration (e.g. `using ::cuda::std::size_t;`) is acceptable to avoid repetition within a function body.
-
-## Headers
-
-- All header inclusions must use the syntax `<header>`.
-- Files must include all headers related to the symbols that they are using.
-- No transitive header inclusion are allowed.
-- Unneeded headers must be removed.
-- The headers must be the most precise one, e.g. `#include <cuda/std/__type_traits/is_array.h>`.
-- Headers in `cuda/std/__cccl/` must not be included directly (they are provided by `__config` or the prologue/epilogue mechanism).
-
-- All headers must have the correct license.
-
-- `libcudacxx/include/cuda/std` files: If the file is ported from LLVM libc++ then we *must* use the LLVM license.
-- `libcudacxx/include/cuda/` files: use the Apache License v2.0 with LLVM Exceptions.
-- All headers must have the include guard, with the correct name: uppercase full path from the root, separated by `_`.
-- The closing `#endif` always carries a comment repeating the guard name.
-- Right after the include guard, the code must include:
-```cpp
-#include <cuda/std/detail/__config>
-
-#if defined(_CCCL_IMPLICIT_SYSTEM_HEADER_GCC)
-#  pragma GCC system_header
-#elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_CLANG)
-#  pragma clang system_header
-#elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_MSVC)
-#  pragma system_header
-#endif // no system header
-```
-- The last included header must be `#include <cuda/std/__cccl/prologue.h>` before the code, and `#include <cuda/std/__cccl/epilogue.h>` at the end of a file.
-
-## Comments
-
-- Commented code without a description is not allowed.
-- Use Doxygen-style `//! @brief comments`.
-- When a function is documented with Doxygen, it must include: `//! @brief`, `//! @param[in/out/in,out]` for every parameter, and `//! @return` for non-void functions.
-- The `@brief/@param/@return` description must accurately reflect the current functionality of the function.
-
-## General guidelines
-
-- The code must reuse `cuda/` or `cuda/std` functionalities as much as possible, including macros.
-- Try to use modern C++ as much as possible. The repository supports C++17 but many more recent functionalities have been backported with functions and macros.
-
-## Prevent compiler errors and improve compatibility
-
-- Never allow lambda expressions in device-only or host-device code.
-- Protect host-only code with `#if !_CCCL_COMPILER(NVRTC)`.
-- Remove unused code, variables, functions, types, template parameters, headers, etc.
-- Variables that are unsigned, or that can become unsigned after template instantiation, must not check for negative values directly. Use `cuda::std::is_unsigned_v<T> ? false : (var < 0)` instead.
diff --git a/.agent/skills/cccl-libcudacxx/SKILL.md b/.agent/skills/cccl-libcudacxx/SKILL.md
new file mode 100644
index 00000000000..39fec400251
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/SKILL.md
@@ -0,0 +1,74 @@
+---
+description: |
+  Tour and orientation for the libcudacxx subdirectory — what the library is, how the
+  include tree is laid out, the LLVM upstream-tracking model, CCCL-specific additions
+  under `<cuda/std/...>`, test suite structure, and how style is enforced.
+  Triggers: "what is libcudacxx", "how does libcudacxx work", "libcudacxx overview",
+  "libcudacxx code style", "make this libcudacxx code compliant".
+---
+
+# libcudacxx
+
+libcudacxx is CCCL's C++ standard library for both host and device code. It provides
+`<cuda/std/...>` headers that mirror the C++ standard library (`<cuda/std/atomic>`,
+`<cuda/std/tuple>`, etc.) and CCCL-specific extensions under `<cuda/...>`.
+
+## Directory layout
+
+| Path                            | Contents                                        |
+|---------------------------------|--------------------------------------------------|
+| `libcudacxx/include/cuda/std/`  | Headers ported or tracking LLVM libc++          |
+| `libcudacxx/include/cuda/std/__cccl/` | CCCL config, prologue/epilogue machinery      |
+| `libcudacxx/include/cuda/`      | CCCL-only extensions (not tracked upstream)     |
+| `libcudacxx/test/`              | Lit test suite + unit tests                     |
+| `libcudacxx/cmake/`             | CMake helpers used by the library's build       |
+
+## Upstream-tracking model
+
+`libcudacxx/include/cuda/std/` tracks LLVM libc++. Files ported from LLVM carry the
+LLVM license. Files under `libcudacxx/include/cuda/` (CCCL-only) use the Apache License
+v2.0 with LLVM Exceptions. License choice follows directory, not content.
+
+When syncing from upstream, preserve LLVM naming and structure inside `cuda/std/`; apply
+CCCL macros and visibility annotations on top without restructuring.
+
+## CCCL-specific include subtree
+
+`cuda/std/__cccl/` is the configuration layer. It is not included directly — `__config`
+or the prologue/epilogue mechanism provides it. Every header must include:
+
+1. `<cuda/std/detail/__config>` immediately after the include guard.
+2. System-header pragmas (`_CCCL_IMPLICIT_SYSTEM_HEADER_*` guards).
+3. `<cuda/std/__cccl/prologue.h>` as the last include before code.
+4. `<cuda/std/__cccl/epilogue.h>` at the end of the file.
+
+## Test suite
+
+Tests live under `libcudacxx/test/`. Two categories:
+
+- **Lit tests** — structured as `.pass.cpp` / `.fail.cpp` / `.verify.cpp` files;
+  discovered and run by the lit runner. Naming and organization rules in
+  `references/style/testing.md`.
+- **Unit tests** — conventional CMake/CTest targets; heavier integration scenarios.
+
+Run lit tests via `cccl-test` (the underlying `build_and_test_targets.sh` script takes a `--lit-tests` flag).
+
+## Style enforcement
+
+Style is applied to `libcudacxx/include/` and `cudax/include/`. Pre-commit runs
+clang-format and a set of custom checks. CI enforces the same set.
+
+Style is split across focused references below. When making a file compliant, work
+through naming → macros → templates → headers → visibility → comments in that order,
+then verify with `pre-commit run --files <files>`.
+
+## Additional resources
+
+- `references/style/naming.md` — naming conventions: types, functions, macros, files
+- `references/style/macros.md` — `_CCCL_*` macro rules; API, host/device, nodiscard
+- `references/style/templates.md` — template parameters, concepts, SFINAE, `constexpr`
+- `references/style/headers.md` — include order, guard format, license selection
+- `references/style/visibility.md` — `_CCCL_HIDE_FROM_ABI`, inlining, `noexcept` rules
+- `references/style/testing.md` — lit test layout, naming conventions, unit test targets
+- `references/docs.md` — index of libcudacxx documentation (standard/extended/PTX API).
+- `references/tools.md` — build and test scripts for libcudacxx.
diff --git a/.agent/skills/cccl-libcudacxx/references/docs.md b/.agent/skills/cccl-libcudacxx/references/docs.md
new file mode 100644
index 00000000000..a559375dfd8
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/references/docs.md
@@ -0,0 +1,26 @@
+# Documentation index — cccl-libcudacxx
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/libcudacxx/index.rst` | libcudacxx overview: C++ Standard Library in CUDA, CUDA-specific abstractions. |
+| `docs/libcudacxx/setup.rst` | Installation, header organization, C++ standard selection (`-std=c++17/20`). |
+| `docs/libcudacxx/standard_api.rst` | C++ Standard Library features available in device code (containers, algorithms, etc.). |
+| `docs/libcudacxx/standard_api/index.rst` | Auto-generated Standard Library API documentation (and subdirectories). |
+| `docs/libcudacxx/extended_api.rst` | CUDA extensions beyond C++ Standard Library: atomics, barriers, synchronization. |
+| `docs/libcudacxx/extended_api/index.rst` | Auto-generated CUDA-specific API documentation (and subdirectories). |
+| `docs/libcudacxx/runtime.rst` | CUDA runtime interaction: streams, events, synchronization primitives. |
+| `docs/libcudacxx/api/index.rst` | Full API reference index and category organization (and subdirectories). |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/libcudacxx/ptx_api.rst` | Low-level PTX instruction access and inline assembly. |
+| `docs/libcudacxx/ptx/index.rst` | Auto-generated PTX intrinsic documentation (and subdirectories). |
+| `docs/libcudacxx/tile.rst` | Tile data layout and hierarchical algorithm abstractions. |
+
+## See also
+
+- `cccl_detail-cpp-macros` for visibility, ABI, and diagnostic macro internals.
diff --git a/.agent/skills/cccl-libcudacxx/references/style/headers.md b/.agent/skills/cccl-libcudacxx/references/style/headers.md
new file mode 100644
index 00000000000..d7d7040816a
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/references/style/headers.md
@@ -0,0 +1,84 @@
+# Header conventions
+
+## Include syntax
+
+All inclusions must use angle-bracket form: `#include <header>`. No quoted includes.
+
+## Self-sufficiency
+
+Each file must include every header for every symbol it uses. Transitive inclusion is not
+allowed — do not rely on a symbol being pulled in by another header you include.
+Unneeded headers must be removed.
+
+## Precision
+
+Use the most precise header available. Prefer the internal single-symbol header over the
+umbrella:
+
+```cpp
+#include <cuda/std/__type_traits/is_array.h>   // correct
+#include <cuda/std/type_traits>                  // too broad
+```
+
+## Headers in `cuda/std/__cccl/`
+
+Do not include headers from `cuda/std/__cccl/` directly. They are provided by
+`<cuda/std/detail/__config>` or the prologue/epilogue mechanism.
+
+## Required boilerplate — order
+
+Every header must follow this structure in order:
+
+```cpp
+// 1. License block
+
+// 2. Include guard
+#ifndef _CUDA_STD_<UPPER_FULL_PATH>
+#define _CUDA_STD_<UPPER_FULL_PATH>
+
+// 3. Config header (immediately after guard)
+#include <cuda/std/detail/__config>
+
+// 4. System-header pragmas
+#if defined(_CCCL_IMPLICIT_SYSTEM_HEADER_GCC)
+#  pragma GCC system_header
+#elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_CLANG)
+#  pragma clang system_header
+#elif defined(_CCCL_IMPLICIT_SYSTEM_HEADER_MSVC)
+#  pragma system_header
+#endif // no system header
+
+// 5. Other includes
+
+// 6. Prologue (last include before code)
+#include <cuda/std/__cccl/prologue.h>
+
+// ... file content ...
+
+// 7. Epilogue (end of file)
+#include <cuda/std/__cccl/epilogue.h>
+
+#endif // _CUDA_STD_<UPPER_FULL_PATH>
+```
+
+## Include guard naming
+
+The guard name is the full path from the repo root, uppercased, with `/` and `.`
+replaced by `_`:
+
+- `libcudacxx/include/cuda/std/atomic` → `_CUDA_STD_ATOMIC`
+- `libcudacxx/include/cuda/atomic` → `_CUDA_ATOMIC`
+
+The closing `#endif` always carries a comment repeating the guard name:
+```cpp
+#endif // _CUDA_STD_ATOMIC
+```
+
+## License selection
+
+| Directory                                                | License                               |
+|------------------------------------------------------------|---------------------------------------|
+| `libcudacxx/include/cuda/std/` (ported from LLVM libc++) | LLVM license                          |
+| `libcudacxx/include/cuda/` (CCCL-only extensions)        | Apache License v2.0 with LLVM Exceptions |
+
+License follows directory location, not content origin.
diff --git a/.agent/skills/cccl-libcudacxx/references/style/macros.md b/.agent/skills/cccl-libcudacxx/references/style/macros.md
new file mode 100644
index 00000000000..1bd1cba1de8
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/references/style/macros.md
@@ -0,0 +1,69 @@
+# Macro rules
+
+## API linkage macros
+
+Every function must carry exactly one of:
+
+| Macro                 | Use                  |
+|-----------------------|----------------------|
+| `_CCCL_HOST_API`      | Host-only function   |
+| `_CCCL_DEVICE_API`    | Device-only function |
+| `_CCCL_API`           | Host-device function |
+
+## `inline` requirement
+
+Non-template, non-`constexpr` functions must use `inline`.
+
+## `[[nodiscard]]`
+
+Most functions with a non-void return type must use `[[nodiscard]]`. Exceptions: functions
+with known side effects (e.g. `cuda::std::copy`).
+
+## `noexcept`
+
+All functions that do not throw must be marked `noexcept`.
+
+## `constexpr`
+
+Use `constexpr` for all functions that do not depend on run-time features (pointers,
+device memory, etc.). Variables that can be evaluated at compile time must also be
+`constexpr`.
+
+## `inline` on `constexpr` variables
+
+All `constexpr` variables at namespace or global scope must use `inline`, including
+template variables:
+
+```cpp
+inline constexpr int foo = 42;
+template <typename T>
+inline constexpr bool is_foo_v = ...;
+```
+
+## `const` on non-modified variables
+
+All variables that are not modified must use `const`, including:
+- Variables initialized by casts (`static_cast`, `reinterpret_cast`, `bit_cast`)
+- Function return values captured in a local
+- Loop-invariant computations
+
+## Compiler-compatibility macros
+
+- Never allow lambda expressions in device-only or host-device code.
+- Protect host-only code with `#if !_CCCL_COMPILER(NVRTC)`.
+- Variables that are unsigned (or can become unsigned after template instantiation) must
+  not check for negative values directly:
+
+```cpp
+// Wrong
+if (var < 0) { ... }
+
+// Correct
+if (::cuda::std::is_unsigned_v<T> ? false : (var < 0)) { ... }
+```
+
+## General macro policy
+
+- Reuse `cuda/` or `cuda/std/` macros wherever they exist; do not roll bespoke macros for
+  things already covered by the CCCL or standard library infrastructure.
+- Remove unused macros, variables, functions, types, template parameters, and headers.
diff --git a/.agent/skills/cccl-libcudacxx/references/style/naming.md b/.agent/skills/cccl-libcudacxx/references/style/naming.md
new file mode 100644
index 00000000000..70c361500d5
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/references/style/naming.md
@@ -0,0 +1,53 @@
+# Naming conventions
+
+## Symbol naming
+
+| Symbol kind           | Style               | Example        |
+|----------------------|---------------------|-----------------|
+| Macros                | `UPPER_SNAKE_CASE`  | `MY_MACRO`      |
+| Template parameters   | `CamelCase`         | `MyParameter`   |
+| All other symbols     | `lower_snake_case`  | `my_variable`   |
+
+## Non-public symbols — reserved identifier prefix
+
+Non-public symbols must be C++ reserved identifiers:
+
+- Macros and template parameters: single-underscore prefix — `_MY_MACRO`, `_MyParameter`.
+- All other symbols: double-underscore prefix — `__my_variable`.
+
+## Template parameter names
+
+Avoid single-letter template parameter names. Prefer short but descriptive names:
+
+- Wrong: `_T`, `_U`
+- Correct: `_Tp`, `_Up`, `_Key`, `_Value`
+
+## Type qualification
+
+Type names must be fully qualified except when already declared in the current namespace.
+This includes standard integer type aliases:
+
+```cpp
+::cuda::std::size_t
+::cuda::std::uintptr_t
+::cuda::std::int32_t
+```
+
+A local `using` declaration is acceptable to reduce repetition within a function body:
+
+```cpp
+using ::cuda::std::size_t;
+```
+
+## Free function calls
+
+All calls to free functions must be fully qualified from the global namespace — including
+calls to functions in the same namespace:
+
+```cpp
+// Inside namespace cuda::
+::cuda::ceil_div(a, b);   // correct
+ceil_div(a, b);            // wrong — unqualified
+```
+
+This rule does not apply to static member functions of classes.
diff --git a/.agent/skills/cccl-libcudacxx/references/style/templates.md b/.agent/skills/cccl-libcudacxx/references/style/templates.md
new file mode 100644
index 00000000000..b059c4aa8dd
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/references/style/templates.md
@@ -0,0 +1,46 @@
+# Template, concept, and `constexpr` conventions
+
+## Template parameters
+
+- Use `CamelCase` prefixed with `_` for all template parameters: `_Tp`, `_Up`, `_Key`,
+  `_Value`, `_Alloc`.
+- Avoid single-letter names. `_T` and `_U` are prohibited; `_Tp` and `_Up` are the
+  conventional replacements.
+
+## `constexpr` functions
+
+Use `constexpr` for all functions that do not depend on run-time features (raw pointers,
+device memory, system calls). The default should be `constexpr`; opt out only when a
+specific run-time dependency prevents it.
+
+## Trailing return types
+
+When the return type is not explicitly spelled out (`auto`), a trailing return type is
+strongly preferred:
+
+```cpp
+auto abs(float x) -> float;
+auto make_pair(_Tp t, _Up u) -> pair<_Tp, _Up>;
+```
+
+This makes the return type visible at the point of declaration without requiring the
+reader to parse the full signature.
+
+## SFINAE and `enable_if`
+
+Prefer `_CCCL_REQUIRES` or concept-based constraints when available. When falling back to
+`enable_if`, place it in the trailing return type or as a defaulted non-type template
+parameter — never in the function parameter list.
+
+## Modern C++
+
+The repository supports C++17 as the minimum, but many C++20 and later features have been
+backported via CCCL macros and `cuda/std` wrappers. Prefer the backported form over
+manual SFINAE when available:
+
+- Concepts → `_CCCL_REQUIRES` / `_CCCL_CONCEPT`
+- `consteval` replacements → `_CCCL_CONSTEVAL`
+- `[[nodiscard]]` → use directly (supported from C++17)
+
+Consult `libcudacxx/include/cuda/std/__cccl/` for the available portability macros before
+writing bespoke SFINAE.
diff --git a/.agent/skills/cccl-libcudacxx/references/style/testing.md b/.agent/skills/cccl-libcudacxx/references/style/testing.md
new file mode 100644
index 00000000000..4fabc78d536
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/references/style/testing.md
@@ -0,0 +1,65 @@
+# Test suite organization
+
+## Two test categories
+
+| Category  | Location                       | Runner                       |
+|-----------|--------------------------------|------------------------------|
+| Lit tests | `libcudacxx/test/` (recursive) | `llvm-lit` / CCCL lit wrapper |
+| Unit tests | `libcudacxx/test/` (CMake targets) | CTest                        |
+
+Run lit tests via `cccl-test` (the underlying `build_and_test_targets.sh` script takes
+a `--lit-tests` flag). Unit tests run as normal CTest targets.
+
+## Lit test naming
+
+Lit tests are plain `.cpp` files with a suffix that encodes expected outcome:
+
+| Suffix       | Meaning                                                    |
+|--------------|-------------------------------------------------------------|
+| `.pass.cpp`  | Must compile and run successfully                           |
+| `.fail.cpp`  | Must fail to compile                                        |
+| `.verify.cpp` | Compiler output is checked against `// expected-*` annotations |
+
+## Lit test layout
+
+Tests mirror the include tree. A test for `<cuda/std/atomic>` lives under
+`libcudacxx/test/std/atomics/`. CCCL-specific extensions under `<cuda/...>` have tests
+under `libcudacxx/test/cuda/`.
+
+## Lit test file structure
+
+Each lit test is self-contained:
+
+```cpp
+// SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. ...
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// RUN: %{build}
+// RUN: %{run}
+
+#include <cuda/std/...>
+// test body
+```
+
+The `RUN:` lines invoke the lit substitution variables defined by the test suite
+configuration.
+
+## Doxygen in production headers
+
+Functions documented with Doxygen must include all three tags; partial documentation is
+not allowed:
+
+```cpp
+//! @brief One-line description of what the function does.
+//! @param[in] x Description of x.
+//! @param[out] result Description of result.
+//! @return What the return value means.
+```
+
+Omit the `//! @return` line only for `void` functions. The description must reflect
+current behavior — stale docs are treated as bugs.
+
+## Comments
+
+- Commented-out code without an explanatory comment is not allowed.
+- Use `//! @brief` for Doxygen; `//` for inline implementation notes.
diff --git a/.agent/skills/cccl-libcudacxx/references/style/visibility.md b/.agent/skills/cccl-libcudacxx/references/style/visibility.md
new file mode 100644
index 00000000000..3be39f34bc4
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/references/style/visibility.md
@@ -0,0 +1,53 @@
+# Visibility, linkage, and ABI rules
+
+## API macros (required on every function)
+
+Every function in `libcudacxx/include/` must carry exactly one API macro:
+
+| Macro               | Meaning         |
+|---------------------|-----------------|
+| `_CCCL_HOST_API`    | Host-only       |
+| `_CCCL_DEVICE_API`  | Device-only     |
+| `_CCCL_API`         | Host + device   |
+
+These macros control symbol visibility and `__forceinline__` / `inline` expansion.
+
+## `inline` on non-template, non-`constexpr` functions
+
+Non-template, non-`constexpr` functions must be marked `inline` in addition to the API
+macro. Without `inline`, multiple-definition errors arise in translation units that
+include the same header.
+
+## `noexcept`
+
+All functions that do not throw must carry `noexcept`. Omitting it leaves the ABI surface
+wider than necessary and disables optimizer paths.
+
+## `[[nodiscard]]`
+
+Apply `[[nodiscard]]` to every function with a non-void return type unless the function
+has a well-known side effect (e.g. `cuda::std::copy`, `cuda::std::fill`). When in doubt,
+annotate.
+
+## `_CCCL_HIDE_FROM_ABI`
+
+Internal helpers that must not appear in the public ABI are annotated with
+`_CCCL_HIDE_FROM_ABI`. This attribute combines `__attribute__((visibility("hidden")))` on
+GCC/Clang with MSVC equivalents. Apply it to:
+
+- Implementation-detail free functions in anonymous namespaces or `__` prefixed helpers.
+- Static member functions of detail classes.
+
+Do not apply it to anything part of the public API.
+
+## `constexpr` variables at namespace scope
+
+All `constexpr` variables at namespace or global scope must use `inline` to avoid ODR
+violations across translation units:
+
+```cpp
+inline constexpr bool __is_constant_evaluated_v = ...;
+
+template <typename _Tp>
+inline constexpr bool is_integral_v = ...;
+```
diff --git a/.agent/skills/cccl-libcudacxx/references/tools.md b/.agent/skills/cccl-libcudacxx/references/tools.md
new file mode 100644
index 00000000000..719263280a0
--- /dev/null
+++ b/.agent/skills/cccl-libcudacxx/references/tools.md
@@ -0,0 +1,9 @@
+# Tool index — cccl-libcudacxx
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/build_libcudacxx.sh` | Full-matrix libcudacxx build: host/std/arch sweep. | `cccl-build` → `references/tools.md` |
+| `ci/test_libcudacxx.sh` | Full-matrix libcudacxx test via lit + ctest; requires GPU. | `cccl-test` → `references/tools.md` |
+| `ci/util/build_and_test_targets.sh` | Targeted build+test; `--lit-precompile-tests` / `--lit-tests` flags drive libcudacxx lit runs. | `cccl-build` → `references/build_and_test_targets_usage.md` |
diff --git a/.agent/skills/cccl-pr/SKILL.md b/.agent/skills/cccl-pr/SKILL.md
index 7a39f6c8fc0..f46991513a5 100644
--- a/.agent/skills/cccl-pr/SKILL.md
+++ b/.agent/skills/cccl-pr/SKILL.md
@@ -1,25 +1,21 @@
 ---
-name: cccl-pr
-description: "Manage CCCL pull requests — open a new draft PR after commits land, edit/comment on an existing PR (title, body, draft↔ready, comments), or push + post `/ok to test` to trigger CI. Detects fork-vs-upstream remote, opens drafts via `gh pr create --draft --repo NVIDIA/cccl`, dispatches `cccl-ok-to-test` for SHA-verified CI triggers. Use when pushing a branch and opening a PR, editing an existing PR's title/body, toggling draft/ready, commenting, or triggering CI. Trigger phrases: \"open a PR\", \"push and PR\", \"update PR description\", \"mark PR ready\", \"comment on the PR\", \"trigger CI on PR\". For commits, route to `cccl-commit` first."
+description: "CCCL pull request lifecycle — open a draft PR, edit title/body/state, comment, push new commits, and trigger CI via SHA-verified `/ok to test`. Refuses on `main`; never force-pushes. Triggers: \"open a PR\", \"push and PR\", \"update PR description\", \"mark PR ready\", \"trigger CI on PR\"."
 ---
 
 # cccl-pr
 
-CCCL PR lifecycle. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Never force-pushes;
-never deletes branches; never closes/merges PRs.
+CCCL PR lifecycle. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Never force-pushes; never deletes branches; never closes or merges PRs.
 
-**Merge-blocker check** — before every push or PR-open operation, detect non-empty `workflows.override` in
-`ci/matrix.yaml` and any `[skip-*]` tags on HEAD's commit message. Both block merge. Surface via `cccl-clarify`
-as a reminder — typically fine for in-progress work, but must be reset before final merge.
+**Merge-blocker check** — before every push or PR-open, detect non-empty `workflows.override` in `ci/matrix.yaml` and `[skip-*]` tags on HEAD's commit message. Both block merge. Surface via `cccl-clarify` — must be reset before final merge.
 
 ## Step 1 — Resolve mode
 
-`cccl-clarify` (or infer from phrasing):
+Infer from phrasing or ask via `cccl-clarify`:
 
 - **Open new draft PR** → Phase 1.
 - **Edit existing PR** (title / body / draft↔ready / base) → Phase 2.
 - **Comment** → Phase 3.
-- **Push + `/ok to test`** → Phase 4.
+- **Push + trigger CI** → Phase 4.
 
 ## Phase 1 — Open a new draft PR
 
@@ -37,19 +33,15 @@ git remote -v
 gh pr view --json headRepositoryOwner   # if branch already has an upstream PR
 ```
 
-Fork remote present → push there. Only `origin` and it points at `NVIDIA/cccl` → user is a maintainer; confirm
-before pushing. Ambiguous → `cccl-clarify`.
+Fork remote present → push there. Only `origin` pointing at `NVIDIA/cccl` → maintainer; confirm before pushing. Ambiguous → `cccl-clarify`.
 
 ### 1.3 Push
 
-`git push -u <remote> <branch>` (mutating; expect prompt). Capture any "view PR" URL hint from the output.
+`git push -u <remote> <branch>` (mutating; expect prompt).
 
 ### 1.4 Draft title + body, open PR
 
-Seed from `git log --oneline main..HEAD`. Title ≤ 72 chars, imperative. Body: bulleted commit summary, refs to
-issues/PRs, test plan when non-trivial. `cccl-clarify` → confirm / revise / cancel.
-
-Print the generated PR description to chat and ask the user to confirm or edit. On confirm, write to `/tmp/claude/<sessionid>/pr-body.md` and run:
+Seed from `git log --oneline main..HEAD`. Title ≤ 72 chars, imperative. Body: bulleted commit summary, issue refs, test plan when non-trivial. Confirm via `cccl-clarify`. On confirm:
 
 ```
 gh pr create --draft --repo NVIDIA/cccl --base main \
@@ -62,21 +54,17 @@ Capture the new PR number from the returned URL.
 
 ### 1.5 Trigger CI
 
-`cccl-clarify` → dispatch `cccl-ok-to-test` now (recommended; drafts need `/ok to test <SHA>` to start CI). Then
-suggest `ScheduleWakeup(delaySeconds=1200)` polling on `gh pr checks <PR#>`.
+Ask via `cccl-clarify` whether to trigger CI now. Drafts need `/ok to test <SHA>` to start CI. On yes, run Phase 4 steps.
 
 ## Phase 2 — Edit an existing PR
 
-Resolve PR# from current branch (`gh pr view --json number`) or user input. `cccl-clarify`:
+Resolve PR# from current branch (`gh pr view --json number`) or user input. One approval per operation, never bundled.
 
 - **Edit title** — draft, confirm, `gh pr edit <PR#> --title "<new>"`.
-- **Edit body** — read current via `gh pr view <PR#> --json body`, draft, confirm,
-  `gh pr edit <PR#> --body-file /tmp/claude/<sessionid>/pr-body.md`.
+- **Edit body** — read current via `gh pr view <PR#> --json body`, draft, confirm, `gh pr edit <PR#> --body-file /tmp/claude/<sessionid>/pr-body.md`.
 - **Mark ready** — `gh pr ready <PR#>`.
 - **Mark draft** — `gh pr ready <PR#> --undo`.
-- **Change base** — `gh pr edit <PR#> --base <new-base>`. Rare.
-
-All mutating; one approval per use, never bundled.
+- **Change base** — `gh pr edit <PR#> --base <new-base>`.
 
 ## Phase 3 — Comment
 
@@ -86,19 +74,51 @@ Resolve PR#. Draft body, confirm via `cccl-clarify`, then:
 gh pr comment <PR#> --repo NVIDIA/cccl --body "<comment>"
 ```
 
-For `/ok to test <SHA>` specifically, use Phase 4 — the `cccl-ok-to-test` agent owns the SHA gate.
+For `/ok to test <SHA>` specifically, use Phase 4 — it owns the SHA gate.
+
+## Phase 4 — Trigger CI
+
+For a PR whose branch has local commits not yet tested.
+
+### 4.1 Push (if needed)
+
+`git push <remote> <branch>` (never force unless the user explicitly requests it).
+
+Never use `--force` or `+<ref>` unless the user explicitly requests it after seeing the risk.
+
+### 4.2 SHA verification gate
+
+1. `git rev-parse HEAD` → `LOCAL_SHA`.
+2. `gh pr view <PR#> --repo NVIDIA/cccl --json headRefOid,isDraft,headRefName` → `REMOTE_SHA`, `isDraft`, `headRefName`.
+3. `headRefName` mismatch → abort showing both values.
+4. `LOCAL_SHA != REMOTE_SHA` → abort:
+
+```
+ERROR: local HEAD does not match remote PR head.
+  local:   <LOCAL_SHA>
+  remote:  <REMOTE_SHA>
+Likely: unpushed commits, or a concurrent push.
+Aborting without posting /ok to test.
+```
+
+### 4.3 Post comment
+
+```
+gh pr comment <PR#> --repo NVIDIA/cccl --body "/ok to test <LOCAL_SHA>"
+```
+
+### 4.4 Poll reminder
 
-## Phase 4 — Push + `/ok to test`
+Suggest `ScheduleWakeup(delaySeconds=1200)` polling on `gh pr checks <PR#>`.
 
-For an existing PR whose branch has new local commits.
+## Good-enough criterion
 
-1. `git push <remote> <branch>` (never force unless *explicitly* told by the user).
-2. Dispatch the `cccl-ok-to-test` agent. It owns the SHA verification, the comment, and the polling reminder.
+PR is open, branch is pushed, CI is running (or the user chose to skip triggering CI).
 
 ## Hard prohibitions
 
-- Never force-push (no `--force`, no `+<ref>`).
-- Never `gh pr close` / `gh pr merge` — out of scope.
-- Never bypass the `cccl-ok-to-test` SHA gate by posting `/ok to test` directly.
+- Never force-push (`--force`, `+<ref>`).
+- Never `gh pr close` or `gh pr merge` — out of scope.
+- Never post `/ok to test` without completing the SHA verification gate.
 - Never edit on `main`.
-- Never bundle multiple mutating ops into one approval.
+- Never bundle multiple mutating operations into one approval.
diff --git a/.agent/skills/cccl-precommit/SKILL.md b/.agent/skills/cccl-precommit/SKILL.md
new file mode 100644
index 00000000000..73bde12673e
--- /dev/null
+++ b/.agent/skills/cccl-precommit/SKILL.md
@@ -0,0 +1,102 @@
+---
+description: |
+  CCCL's pre-commit hook suite — hooks configured, what each enforces,
+  how to install and run locally, the auto-fix-and-restage pattern,
+  and how CI enforces the suite via pre-commit.ci.
+  Triggers: "run pre-commit", "format code", "lint cccl", "what linters
+  run", "clang-format failed", "pre-commit hook failed", "fix formatting".
+---
+
+# Pre-commit
+
+Reference and orientation for CCCL's pre-commit setup. Configuration lives
+in `.pre-commit-config.yaml`; tool settings (ruff, codespell, mypy) live in
+the root `pyproject.toml`; CMake formatter settings are in `.gersemirc`.
+
+## Hook inventory
+
+| Hook                            | Tool                                  | What it checks / fixes                                                                                                                           |
+|---------------------------------|---------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
+| `end-of-file-fixer`             | pre-commit/pre-commit-hooks           | Ensures files end with a newline                                                                                                               |
+| `mixed-line-ending`             | pre-commit/pre-commit-hooks           | Normalises CRLF → LF                                                                                                                           |
+| `trailing-whitespace`           | pre-commit/pre-commit-hooks           | Strips trailing whitespace (non-C/C++/CUDA; those go to clang-format)                                                                         |
+| `check-json`                    | pre-commit/pre-commit-hooks           | JSON parse check                                                                                                                               |
+| `check-toml`                    | pre-commit/pre-commit-hooks           | TOML parse check                                                                                                                               |
+| `pretty-format-json`            | pre-commit/pre-commit-hooks           | Auto-formats JSON: 2-space indent, stable key order                                                                                            |
+| `check-symlinks`                | pre-commit/pre-commit-hooks           | Detects broken symlinks                                                                                                                        |
+| `check-executables-have-shebangs` | pre-commit/pre-commit-hooks           | Executables must have a shebang                                                                                                                |
+| `check-merge-conflict`          | pre-commit/pre-commit-hooks           | Rejects leftover conflict markers                                                                                                              |
+| `check-yaml`                    | pre-commit/pre-commit-hooks           | YAML parse check                                                                                                                               |
+| `shellcheck`                    | shellcheck-py                         | Shell script linter — excludes `libcudacxx/cmake/config.guess`                                                                                 |
+| `clang-format`                  | mirrors-clang-format v20              | Formats `.c/.cpp/.cu/.cuh/.cxx/.h/.hpp/.inl/.mm` and header-only files under `libcudacxx/include/`; uses `.clang-format` (LLVM-based, 120-col limit) |
+| `ruff`                          | astral-sh/ruff-pre-commit             | Python linter with auto-fix                                                                                                                    |
+| `ruff-format`                   | astral-sh/ruff-pre-commit             | Python formatter                                                                                                                               |
+| `gersemi`                       | BlankSpruce/gersemi                   | CMake formatter; 80-col, 2-space indent; custom definitions from `cmake/` and `lib/cmake/thrust/`; extensions in `.gersemi/ext/`              |
+| `codespell`                     | codespell-project/codespell           | Spell-check; config in `pyproject.toml [tool.codespell]`; ignore list in `.codespell-ignore.txt`                                              |
+| `mypy`                          | pre-commit/mirrors-mypy               | Type-checks `python/cuda_cccl/cuda/compute/` against `python/cuda_cccl/pyproject.toml`; does not run per-file (pass_filenames: false)       |
+| `check-shebang`                 | local (`ci/util/pre-commit/check_shebang.py`) | Enforces `#!/usr/bin/env <interp>` form; auto-fixes absolute shebang paths                                                                      |
+
+## Installing locally
+
+```
+pip install pre-commit
+pre-commit install        # installs the git hook
+```
+
+After installation, the suite runs automatically on every `git commit` against
+staged files only.
+
+## Running manually
+
+Against staged (or specific) files before committing:
+
+```
+pre-commit run --files <file1> <file2> ...
+```
+
+Against the entire tree (slow; use for first-time setup or bulk fixes):
+
+```
+pre-commit run --all-files
+```
+
+## Auto-fix-and-restage
+
+Several hooks modify files in place (clang-format, ruff-format, gersemi,
+end-of-file-fixer, pretty-format-json, check-shebang). When a hook rewrites
+a file, pre-commit exits non-zero even though the fix was applied.
+
+Pattern:
+
+1. Run pre-commit — it exits non-zero and fixes files.
+2. Review the diffs.
+3. `git add` the fixed files.
+4. Re-run pre-commit (or `git commit` again) — the hooks pass.
+
+Do not skip this review step. Hook auto-fixes occasionally over-correct edge
+cases (e.g. clang-format rewrites around CCCL macros; gersemi on hand-formatted
+CMake). Inspect the diff before staging.
+
+## CI enforcement
+
+The `.pre-commit-config.yaml` includes a `ci:` block for
+[pre-commit.ci](https://pre-commit.ci). This service runs the full hook suite
+on every push to a pull-request branch. `autofix_prs: false` means it reports
+failures but does not open automatic fix PRs. CI updates hook revisions on a
+quarterly schedule.
+
+Linter failures on pre-commit.ci block PR merges. Fix locally and push — the
+service re-runs on the next push.
+
+## Skipping a hook
+
+Only when absolutely necessary (e.g., a known false positive in generated code):
+
+```
+SKIP=<hook-id> git commit ...
+# or: git commit --no-verify  (skips all hooks — use sparingly)
+```
+
+Prefer `# noqa`, `# type: ignore`, or a codespell ignore-words entry over
+blanket skips. Persistent skips should be encoded in `.pre-commit-config.yaml`
+via `exclude:` or `args:` at the hook level.
diff --git a/.agent/skills/cccl-python/SKILL.md b/.agent/skills/cccl-python/SKILL.md
index 56f65172d5c..48547b4d43c 100644
--- a/.agent/skills/cccl-python/SKILL.md
+++ b/.agent/skills/cccl-python/SKILL.md
@@ -1,12 +1,10 @@
 ---
-name: cccl-python
-description: "CCCL's Python packages (`cuda-cccl`): installation, module layout, build/test scripts, test organization. Use when the user works on the Python bindings, builds/tests Python components, or asks about the `cuda.compute` / `cuda.coop` / `cuda.cccl.headers` modules. Trigger phrases: \"cccl python\", \"cuda.compute\", \"cuda.coop\", \"cuda-cccl package\", \"build the python bindings\", \"test python\"."
+description: "CCCL's Python package (`cuda-cccl`) under `python/cuda_cccl/`: modules, build/test CI scripts, install extras, layout. Triggers: \"cccl python\", \"cuda.compute\", \"cuda.coop\", \"cuda-cccl package\", \"build the python bindings\", \"test python\"."
 ---
 
 # cccl-python
 
-Python components live under `python/cuda_cccl/`. Build/test scripts take `-py-version` instead of compiler flags.
-Supported: Python 3.10 – 3.13.
+Python components live under `python/cuda_cccl/`. Requires Python 3.10+, CTK 12.x or 13.x, GPU CC 6.0+.
 
 ## Modules
 
@@ -17,23 +15,20 @@ Supported: Python 3.10 – 3.13.
 ## Install from source
 
 ```
-pip install -e python/cuda_cccl[test-cu13]   # or [test-cu12] for CTK 12.X
+pip install -e python/cuda_cccl[test-cu12]   # or [test-cu13] for CTK 13.x
 ```
 
-Requires CTK 12.x or 13.x, NVIDIA GPU CC 6.0+. Base deps: `numba>=0.60.0`, `numpy`, `cuda-pathfinder>=1.2.3`,
-`cuda-core`, `typing_extensions`. CUDA extras add `cuda-bindings`, `cuda-toolkit`, `numba-cuda`.
+## Build / test scripts
 
-## Build / test
+Scripts under `ci/`; pass `-py-version 3.10` (or 3.11–3.13).
 
-```
-./ci/build_cuda_cccl_python.sh        -py-version 3.10
-./ci/test_cuda_compute_python.sh      -py-version 3.10
-./ci/test_cuda_coop_python.sh         -py-version 3.10
-./ci/test_cuda_cccl_headers_python.sh -py-version 3.10
-./ci/test_cuda_cccl_examples_python.sh -py-version 3.10
-```
-
-Build script needs no GPU; test scripts do.
+| Script                               | GPU required |
+|--------------------------------------|--------------|
+| `ci/build_cuda_cccl_python.sh`       | no           |
+| `ci/test_cuda_compute_python.sh`     | yes          |
+| `ci/test_cuda_coop_python.sh`        | yes          |
+| `ci/test_cuda_cccl_headers_python.sh` | yes          |
+| `ci/test_cuda_cccl_examples_python.sh` | yes          |
 
 ## Layout
 
@@ -44,3 +39,8 @@ python/cuda_cccl/
 ├── benchmarks/
 └── pyproject.toml
 ```
+
+## Additional resources
+
+- `references/docs.md` — index of Python package documentation.
+- `references/tools.md` — build and test scripts for the Python bindings.
diff --git a/.agent/skills/cccl-python/references/docs.md b/.agent/skills/cccl-python/references/docs.md
new file mode 100644
index 00000000000..9276e9e20f0
--- /dev/null
+++ b/.agent/skills/cccl-python/references/docs.md
@@ -0,0 +1,20 @@
+# Documentation index — cccl-python
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/python/index.rst` | Overview of `cuda.compute` and `cuda.coop._experimental` modules. |
+| `docs/python/setup.rst` | Installation, version selection, environment configuration. |
+| `docs/python/compute_api.rst` | Device-level parallel algorithms: reduce, scan, sort, and more. |
+| `docs/python/compute/index.rst` | Auto-generated `cuda.compute` API reference and examples (and subdirectories). |
+| `docs/python/coop.rst` | Cooperative block/warp-level primitives for Numba CUDA. |
+| `docs/python/coop_api.rst` | Full cooperative primitives API documentation. |
+| `docs/python/api_reference.rst` | Index of all Python module APIs. |
+| `python/cuda_cccl/README.md` | Package overview, installation, module summary, and usage examples. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/python/resources.rst` | Learning resources, tutorials, and integration guides. |
diff --git a/.agent/skills/cccl-python/references/tools.md b/.agent/skills/cccl-python/references/tools.md
new file mode 100644
index 00000000000..7c3c7c2c57e
--- /dev/null
+++ b/.agent/skills/cccl-python/references/tools.md
@@ -0,0 +1,18 @@
+# Tool index — cccl-python
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose | Detail |
+|------|---------|--------|
+| `ci/util/python/common_arg_parser.sh` | Shared argument parsing utilities sourced by Python CI scripts. Provides common flags (`--cuda-version`, `--python-version`, etc.) used across `test_cuda_*.sh` scripts. | sourced library, not invoked directly |
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/build_cuda_cccl_wheel.sh` | Builds the cuda-cccl Python wheel package. | `cccl-build` → `references/tools.md` |
+| `ci/build_cuda_cccl_python.sh` | Builds cuda.cccl in-tree (dev install mode). | `cccl-build` → `references/tools.md` |
+| `ci/test_cuda_compute_python.sh` | Tests cuda.compute Python bindings. | `cccl-test` → `references/tools.md` |
+| `ci/test_cuda_coop_python.sh` | Tests cuda.cooperative Python bindings. | `cccl-test` → `references/tools.md` |
+| `ci/test_cuda_cccl_headers_python.sh` | Tests cuda.cccl Python C++ header generation and compilation. | `cccl-test` → `references/tools.md` |
+| `ci/test_cuda_cccl_examples_python.sh` | Tests cuda.cccl Python example scripts. | `cccl-test` → `references/tools.md` |
diff --git a/.agent/skills/cccl-resplit-branch/SKILL.md b/.agent/skills/cccl-resplit-branch/SKILL.md
index cbe2d216e30..b6d6b8ffc35 100644
--- a/.agent/skills/cccl-resplit-branch/SKILL.md
+++ b/.agent/skills/cccl-resplit-branch/SKILL.md
@@ -1,32 +1,34 @@
 ---
-name: cccl-resplit-branch
-description: "Rebase a CCCL feature branch onto `main` and resplit its commit history into a clean series, using the same interactive chunk-walkthrough as `cccl-commit`. Backs up the original branch tip, rebases (resolving conflicts), collapses commits to a single working-tree diff via `git reset --mixed`, then hands off to `cccl-commit`'s split / interactive / commit pipeline. Use when a branch has accumulated messy / squashable / out-of-order commits and needs a clean series before opening or refreshing a PR. Trigger phrases: \"resplit this branch\", \"clean up these commits\", \"rebase and resplit\", \"reorganize the commits\", \"squash and resplit\", \"fix up commit history\". For first-time commits on a fresh branch, use `cccl-commit`."
+description: |
+  Rebase a CCCL feature branch onto `main` and resplit its commit history
+  into a clean series. Backs up the original tip, rebases (resolving
+  conflicts via `cccl-clarify`), collapses to a working-tree diff, then
+  hands off to `cccl-commit`. For first-time commits on a fresh branch,
+  use `cccl-commit` directly.
+  Triggers: "resplit this branch", "clean up these commits", "rebase and resplit", "fix up commit history".
 ---
 
 # cccl-resplit-branch
 
-Rebase onto `main`, then collapse the branch's commits into a working-tree diff and replay them as a clean series
-via `cccl-commit`'s flow. Route every user-facing question through `cccl-clarify`. Refuses on `main`. Never
-force-pushes — that's `cccl-pr` Phase 4 with explicit user approval.
+Rebase onto `main`, collapse the branch's commits into an unstaged working-tree diff, and replay them as a clean series via `cccl-commit`. Refuses on `main`. Never force-pushes — that is `cccl-pr`'s responsibility.
 
 ## Step 1 — Pre-flight
 
-- Refuse on `main` (`git rev-parse --git-dir` vs `--git-common-dir`).
-- Working tree must be clean: `git status --porcelain` empty. Dirty → route to `cccl-commit` first.
-- Scratch: `mkdir -p /tmp/claude/<sessionid>`.
-- `git log --oneline main..HEAD > /tmp/claude/<sessionid>/original-commits.txt`. Empty → nothing to resplit;
-  exit. Branch is already pushed with review activity → `cccl-clarify` confirms the user wants to rewrite
-  published history (force-push will come later via `cccl-pr` Phase 4).
+- Refuse on `main`.
+- Working tree must be clean (`git status --porcelain` empty). Dirty → route to `cccl-commit` first.
+- `mkdir -p /tmp/claude/<sessionid>/resplit`.
+- `git log --oneline main..HEAD > /tmp/claude/<sessionid>/resplit/original-commits.txt`. Empty → nothing to resplit; exit.
+- Branch is already published with review activity → `cccl-clarify` confirms the user wants to rewrite history (force-push follows later via `cccl-pr`).
 
-## Step 2 — Backup the tip
+## Step 2 — Backup
 
-`cccl-clarify` confirms the backup ref name (default `refs/backup/<branch>-<YYYYMMDD-HHMMSS>`). Then:
+Confirm backup ref name via `cccl-clarify` (default `refs/backup/<branch>-<YYYYMMDD-HHMMSS>`), then:
 
 ```
 git update-ref refs/backup/<branch>-<timestamp> HEAD
 ```
 
-Surface the backup ref in every later confirmation prompt — recovery is `git reset --hard <ref>`.
+Recovery is possible at any point before final commits land — see [Recovery](#recovery).
 
 ## Step 3 — Rebase onto `main`
 
@@ -35,77 +37,56 @@ git fetch origin main
 git rebase origin/main
 ```
 
-On conflict, for each conflicted file route through `cccl-clarify`:
+On conflict, route each file through `cccl-clarify`:
 
-- **Resolve manually** — read file, present conflict markers verbatim in chat, suggest resolution, user picks.
-- **Take ours** — `git checkout --ours <file>`.
-- **Take theirs** — `git checkout --theirs <file>`.
-- **Skip commit** — `git rebase --skip` (loses content; only for already-redone work).
-- **Abort** — `git rebase --abort`; surface backup ref; exit.
+| Choice            | Command                                                                    |
+|-------------------|----------------------------------------------------------------------------|
+| Resolve manually  | Present conflict markers verbatim; user picks resolution; `git add <file>` |
+| Take ours         | `git checkout --ours <file>` then `git add <file>`                       |
+| Take theirs       | `git checkout --theirs <file>` then `git add <file>`                     |
+| Skip commit       | `git rebase --skip` (loses content — only for already-redone work)        |
+| Abort             | `git rebase --abort` then stop; recovery ref still intact                 |
 
-After resolution: `git add <file>` per-file (never bulk-stage), then `git rebase --continue`.
+After each resolution: `git add <file>` per-file (never bulk-stage), then `git rebase --continue`.
 
-### 3.1 Verify
+Verify: `git diff main..HEAD --stat > /tmp/claude/<sessionid>/resplit/rebased-stat.txt`. Material mismatch with the original commit list → `cccl-clarify` (continue / inspect / abort).
 
-```
-git diff main..HEAD --stat > /tmp/claude/<sessionid>/rebased-diff-stat.txt
-```
-
-Compare touched-file set to the pre-rebase commit list. Material mismatch → `cccl-clarify` (continue / inspect /
-abort to backup).
-
-## Step 4 — Collapse to working tree
+## Step 4 — Collapse
 
 ```
 git reset --mixed main
 ```
 
-`--mixed` keeps every change in the working tree, unstaged — the starting state `cccl-commit` expects. **Never
-`--hard`** (would discard the work). Mutating; expect prompt; surface the backup ref in the prompt.
+**Never `--hard`** — `--mixed` keeps all changes unstaged, which is the state `cccl-commit` expects. This is irreversible without the backup ref; surface the ref in the confirmation prompt.
 
-Verify: `git diff --stat` must match the rebased diff stat from Step 3.1. Material divergence → STOP.
+Verify: `git diff --stat` must match `rebased-stat.txt`. Divergence → STOP.
 
 ## Step 5 — Hand off to `cccl-commit`
 
-Run `cccl-commit` from Step 1 onward. Splitting and Committing are implicit (a resplit means at least one new
-commit), but offer Interactive (strongly recommended — catches drift the original series hid) and Test gate via
-`cccl-clarify`.
+Invoke `cccl-commit` from Step 1. Seed the chunk planner from `original-commits.txt` — use original commit subjects as draft message starters. The resplit fixes structure, not invents it.
 
-Seed the chunk planner from the original commit series (read `original-commits.txt`) — the resplit's job is to
-*fix* problems, not invent unrelated structure. Use original commit subjects as starting drafts for the new
-messages.
+Offer the Interactive walkthrough (strongly recommended — catches drift the original series hid) and the test gate via `cccl-clarify` before committing.
 
-## Step 6 — Final tree check
-
-After the last commit:
+## Step 6 — Final check
 
 ```
 git diff HEAD refs/backup/<branch>-<timestamp> --stat
 ```
 
-Non-empty → the new branch diverges from the original. Present the delta via `cccl-clarify`:
-
-- **Expected** (user reverted / edited chunks during walkthrough) — accept.
-- **Unexpected** — investigate, or `git reset --hard <backup>` to abort.
-
-Report final tip SHA, commit list, backup ref location, and a force-push reminder if the branch was published.
+Non-empty → present the delta via `cccl-clarify`: expected (user edited chunks) or unexpected (investigate, or reset to backup). Report final tip SHA, commit list, backup ref, and a force-push reminder if the branch was published.
 
 ## Recovery
 
-At any time before commits start landing: `git reset --hard refs/backup/<branch>-<timestamp>` restores the
-original tip. After commits land: same command, but the new series is lost; surface this trade-off when the user
-asks to abort late.
+`git reset --hard refs/backup/<branch>-<timestamp>` restores the original tip at any time before the new commits land. After commits land, the same command discards the new series — surface this trade-off explicitly if the user asks to abort late.
 
 ## Hard prohibitions
 
-- Never `git reset --hard` outside an explicit user-confirmed abort.
-- Never force-push — `cccl-pr` Phase 4 owns that with its own approval.
+- Never `git reset --hard` without explicit user-confirmed abort.
+- Never force-push — `cccl-pr` Phase 4 owns that.
 - Never delete a backup ref without per-ref user approval.
 - Never `--no-verify`.
-- Never co-author / tool-attribution footers.
-- Never `git rebase --abort` autonomously — only on explicit user choice.
+- Never `git rebase --abort` autonomously — only on explicit user choice via `cccl-clarify`.
 
-## Handoff to `cccl-pr`
+## Handoff
 
-If the branch was published, the resplit requires a force-push. Route to `cccl-pr` Phase 4 — and note its
-current force-push prohibition. Until that's opted-in, the user runs `git push --force-with-lease` by hand.
+If the branch was published, a force-push is required. Route to `cccl-pr` Phase 4. Until the user opts in, they run `git push --force-with-lease` by hand.
diff --git a/.agent/skills/cccl-sass-diff/SKILL.md b/.agent/skills/cccl-sass-diff/SKILL.md
index 1f4d8944ec6..62ff9af8ced 100644
--- a/.agent/skills/cccl-sass-diff/SKILL.md
+++ b/.agent/skills/cccl-sass-diff/SKILL.md
@@ -1,6 +1,5 @@
 ---
-name: cccl-sass-diff
-description: "Compare CUDA SASS or PTX between two CCCL builds (commits, branches, working-copy vs HEAD) to detect non-trivial codegen changes while filtering noise from addresses, symbols, metadata, and pure register renaming. Use when the user asks to check for SASS changes, audit ABI/codegen impact of a change, or compare PTX. Trigger phrases: \"check for SASS changes\", \"compare SASS\", \"any codegen impact\", \"PTX diff\"."
+description: "Compare CUDA SASS or PTX between two CCCL builds (commits, branches, working-copy vs HEAD) to detect non-trivial codegen changes — filters addresses, symbols, metadata, and pure register renaming. Triggers: \"check for SASS changes\", \"compare SASS\", \"any codegen impact\", \"PTX diff\"."
 ---
 
 # cccl-sass-diff
diff --git a/.agent/skills/cccl-test/SKILL.md b/.agent/skills/cccl-test/SKILL.md
new file mode 100644
index 00000000000..2f175bfce1c
--- /dev/null
+++ b/.agent/skills/cccl-test/SKILL.md
@@ -0,0 +1,89 @@
+---
+description: |
+  CCCL C++ test paths — fast iteration first, full matrix when needed.
+  Covers `ci/util/build_and_test_targets.sh` for targeted CTest/lit runs
+  and `ci/test_*.sh` for full host/std/arch matrix tests. GPU required for test scripts.
+  Triggers: "run test Y", "how do I run the cub tests", "test libcudacxx",
+  "full matrix test", "run just one test".
+---
+
+# cccl-test
+
+Two test paths. Start with the targeted driver for inner-loop iteration; reach for the full-matrix scripts
+when you need a complete host/std/arch sweep.
+
+## Fast iteration — `ci/util/build_and_test_targets.sh`
+
+Single wrapper around `cmake`, `ninja`, `ctest`, and `lit` for one preset at a time — the same driver
+`cccl-build` uses, with `--ctest-targets` and `--lit-tests` / `--lit-precompile-tests` added for the test
+runners:
+
+- `ctest` — runs the regexes in `--ctest-targets` against the built CTest registry (CUB, Thrust, cudax, C Parallel)
+- `lit` — runs `--lit-tests` (and pre-compiles `--lit-precompile-tests`) for libcudacxx
+
+Run from the repo root, inside the devcontainer. GPU required for execution.
+
+**CTest targets (CUB, Thrust, cudax, C Parallel):**
+
+```
+ci/util/build_and_test_targets.sh \
+  --preset <name> \
+  --build-targets "<target>" \
+  --ctest-targets "<target>"
+```
+
+`--ctest-targets` takes a space-separated list of CTest `-R` regexes. Omit `--build-targets` if already built.
+
+**lit targets (libcudacxx):**
+
+```
+ci/util/build_and_test_targets.sh \
+  --preset libcudacxx \
+  --lit-precompile-tests "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp" \
+  --lit-tests           "std/algorithms/alg.nonmodifying/alg.any_of/any_of.pass.cpp"
+```
+
+Paths are relative to `libcudacxx/test/libcudacxx/`. Avoid `--build-targets "libcudacxx.cpp20.precompile.lit"` — it precompiles the entire suite.
+
+Common preset/target pairs:
+
+| Project    | Preset(s)                      | Target example                   |
+|------------|--------------------------------|----------------------------------|
+| CUB        | `cub-cpp17`, `cub-cpp20`       | `cub.cpp20.test.iterator`        |
+| Thrust     | `thrust-cpp17`, `thrust-cpp20` | `thrust.cpp20.test.reduce`       |
+| cudax      | `cudax`                        | `cudax.cpp20.test.async_buffer`  |
+| C Parallel | `cccl-c-parallel`              | `cccl.c.test.reduce`             |
+
+Also available: `--custom-test-cmd "<cmd>"` for an arbitrary command after CTest.
+
+## Full matrix — `ci/test_*.sh`
+
+Per-project scripts that test across a full host/std/arch sweep. GPU required.
+
+```
+./ci/test_<project>.sh  -cxx <compiler>  -std <std>  -arch "<arch-list>"
+```
+
+| Project    | Script            | Stds    |
+|------------|-------------------|---------|
+| CUB        | `test_cub`        | 17, 20  |
+| Thrust     | `test_thrust`     | 17, 20  |
+| libcudacxx | `test_libcudacxx` | 17, 20  |
+| cudax      | `test_cudax`      | 20 only |
+
+Test scripts build implicitly if the tree is missing. CTest preset form also works:
+`ctest --preset=cub-cpp17`.
+
+Compute-sanitizer variants: append `-compute-sanitizer-{memcheck,racecheck,initcheck,synccheck}`.
+Not all projects support all tools — check `--help`.
+
+Full test runs: 30+ min. Never cancel mid-run.
+
+For architecture flag syntax, see `cccl-build` → `references/arch-flag.md`.
+
+## Additional resources
+
+- `references/docs.md` — index of CCCL test documentation.
+- `references/tools.md` — all test scripts with purpose and cross-references.
+- `cccl-build` → `references/build_and_test_targets_usage.md` — full `build_and_test_targets.sh` interface (shared with build).
+- See `cccl-build` for building before you test.
diff --git a/.agent/skills/cccl-test/references/docs.md b/.agent/skills/cccl-test/references/docs.md
new file mode 100644
index 00000000000..fb5b5182d66
--- /dev/null
+++ b/.agent/skills/cccl-test/references/docs.md
@@ -0,0 +1,19 @@
+# Documentation index — cccl-test
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cccl/development/build_and_bisect_tools.rst` | Build and test tool reference; preset usage, CTest targets, lit test paths. |
+| `docs/cccl/development/testing.rst` | CCCL testing philosophy, parameterization (`%PARAM%`), compute-sanitizer variants. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cub/test_overview.rst` | CUB test infrastructure: test naming, parameterization, coverage strategy. |
+
+## See also
+
+- `cccl-build` `references/docs.md` for build-phase documentation (build before test).
+- `cccl_detail-test-params` `references/` for `%PARAM%` test parameterization details.
diff --git a/.agent/skills/cccl-test/references/tools.md b/.agent/skills/cccl-test/references/tools.md
new file mode 100644
index 00000000000..b202c7e92f6
--- /dev/null
+++ b/.agent/skills/cccl-test/references/tools.md
@@ -0,0 +1,33 @@
+# Tool index — cccl-test
+
+## Used (canonical reference lives in cccl-build)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/util/build_and_test_targets.sh` | Targeted build+test driver; `--ctest-targets` and `--lit-tests` flags run the test phase. | `cccl-build` → `references/build_and_test_targets_usage.md` |
+| `ci/build_common.sh` | Sourced by full-matrix test scripts for option parsing, compiler validation, `test_preset` helper. | `cccl-build` → `references/build_common.sh_usage.md` |
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose | Detail |
+|------|---------|--------|
+| `ci/test_cub.sh` | Full-matrix CUB test: host/std/arch sweep; requires GPU. | inherits options from `build_common.sh_usage.md` |
+| `ci/test_thrust.sh` | Full-matrix Thrust test. | inherits options from `build_common.sh_usage.md` |
+| `ci/test_libcudacxx.sh` | Full-matrix libcudacxx test via lit + ctest. | inherits options from `build_common.sh_usage.md` |
+| `ci/test_cudax.sh` | Full-matrix cudax test (C++20 only). | inherits options from `build_common.sh_usage.md` |
+| `ci/test_cccl_c_parallel.sh` | C Parallel Library test. | inherits options from `build_common.sh_usage.md` |
+| `ci/test_cccl_c_parallel_hostjit.sh` | C Parallel hostjit variant test. | inherits options from `build_common.sh_usage.md` |
+| `ci/test_cccl_c_stf.sh` | CCCL C STF test. | inherits options from `build_common.sh_usage.md` |
+| `ci/test_packaging.sh` | CPM-based downstream consumption test (no GPU required). | inherits options from `build_common.sh_usage.md` |
+| `ci/test_nvbench_helper.sh` | Helper for nvbench benchmark execution during test phase. | inherits options from `build_common.sh_usage.md` |
+| `ci/test_python_common.sh` | Shared Python test utilities; sourced by `test_cuda_*.sh` scripts. | sourced library, not invoked directly |
+| `ci/test_cuda_compute_python.sh` | cuda.compute Python bindings test. | see `cccl-python` → `references/tools.md` |
+| `ci/test_cuda_coop_python.sh` | cuda.cooperative Python bindings test. | see `cccl-python` → `references/tools.md` |
+| `ci/test_cuda_cccl_headers_python.sh` | cuda.cccl Python C++ header generation/compilation test. | see `cccl-python` → `references/tools.md` |
+| `ci/test_cuda_cccl_examples_python.sh` | cuda.cccl Python example script test. | see `cccl-python` → `references/tools.md` |
+
+## Used
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `.devcontainer/launch.sh` | Spin up the devcontainer for GPU test execution. | `cccl-devcontainer` → `references/tools.md` |
diff --git a/.agent/skills/cccl-thrust/SKILL.md b/.agent/skills/cccl-thrust/SKILL.md
new file mode 100644
index 00000000000..ab301bfa094
--- /dev/null
+++ b/.agent/skills/cccl-thrust/SKILL.md
@@ -0,0 +1,103 @@
+---
+description: |
+  Tour and orientation for the Thrust subdirectory — what the library is, how the
+  include tree is laid out, the backend abstraction model, execution policies,
+  relationship to CUB, and test suite structure.
+  Triggers: "what is thrust", "thrust overview", "thrust algorithms",
+  "thrust execution policies", "thrust backends".
+---
+
+# Thrust
+
+Thrust is CCCL's high-level parallel algorithms library. It provides an STL-like
+interface (`thrust::sort`, `thrust::reduce`, `thrust::transform`, …) that runs on
+multiple parallel backends — CUDA GPU, OpenMP, TBB, and serial CPU — selected at
+call time via execution policies.
+
+## Directory layout
+
+| Path                      | Contents                                                  |
+|---------------------------|-----------------------------------------------------------|
+| `thrust/thrust/`          | Public headers — one file per algorithm                  |
+| `thrust/thrust/system/`   | Backend implementations (cuda, cpp, omp, tbb)            |
+| `thrust/thrust/detail/`   | Per-algorithm `.inl` dispatch internals                  |
+| `thrust/thrust/iterator/` | Iterator adaptors (transform, zip, counting, …)          |
+| `thrust/thrust/mr/`       | Memory resource layer                                    |
+| `thrust/testing/`         | Main test suite (`.cu` files, CTest-driven)              |
+| `thrust/testing/cuda/`    | CUDA-backend-specific tests                              |
+| `thrust/examples/`        | Standalone `.cu` examples                                |
+| `thrust/cmake/`           | CMake helpers: target lists, multi-config, header testing |
+
+## Public API surface
+
+Users include flat top-level headers: `<thrust/sort.h>`, `<thrust/reduce.h>`,
+`<thrust/transform.h>`, etc. Each header's implementation body lives in
+`thrust/detail/<algorithm>.inl`, included from the top-level header. There is no
+`thrust/thrust.h` umbrella; users include only what they use.
+
+Container types — `thrust::device_vector`, `thrust::host_vector`,
+`thrust::universal_vector` — live in their own top-level headers.
+
+## Backend abstraction
+
+Each backend occupies `thrust/thrust/system/<name>/`:
+
+| Backend | Namespace         | Description                                  |
+|---------|-------------------|----------------------------------------------|
+| `cuda`  | `thrust::cuda_cub` | GPU execution via CUDA + CUB device primitives |
+| `cpp`   | `thrust::cpp`     | Serial CPU execution (STL / standard algorithms) |
+| `omp`   | `thrust::omp`     | OpenMP parallel CPU                          |
+| `tbb`   | `thrust::tbb`     | Intel TBB parallel CPU                       |
+
+Each backend directory contains `execution_policy.h`, `detail/<algorithm>.h`, and
+supporting headers. Algorithm dispatch follows C++ ADL: the execution policy type
+selects the backend's overload set.
+
+## Execution policies
+
+Execution policies are the user-facing dispatch mechanism. Pass one as the first
+argument to any Thrust algorithm to select a backend:
+
+```
+thrust::sort(thrust::device, v.begin(), v.end());   // CUDA backend
+thrust::sort(thrust::host,   v.begin(), v.end());   // default host backend
+thrust::sort(thrust::seq,    v.begin(), v.end());   // serial (no parallelism)
+
+// stream-scoped:
+thrust::sort(thrust::cuda::par.on(stream), v.begin(), v.end());
+```
+
+`thrust::device` and `thrust::host` resolve to the backends selected by
+`THRUST_DEVICE_SYSTEM` and `THRUST_HOST_SYSTEM` macros (defaults: CUDA and CPP).
+`thrust::cuda::par` is the concrete CUDA policy and supports `.on(stream)` for
+stream association.
+
+See `references/execution-policies.md` for the full policy type hierarchy and
+stream/allocator extensions.
+
+## CUB relationship
+
+The CUDA backend (`thrust/system/cuda/detail/`) delegates device-side work to CUB
+device-scope primitives. For example, `thrust::sort` calls `cub::DeviceRadixSort`
+or `cub::DeviceMergeSort`; `thrust::reduce` calls `cub::DeviceReduce`. Thrust owns
+the policy dispatch and host-side coordination; CUB owns the GPU kernel implementation.
+This means CUDA-backend performance is directly tied to the corresponding CUB primitive.
+
+## Test suite
+
+`thrust/testing/` contains one `.cu` file per algorithm or feature. Tests are
+CUDA-compiled and run via CTest. Backend-specific tests live in subdirectories:
+
+- `testing/cuda/` — CUDA-backend tests (stream, CDP, memcpy flags, etc.)
+- `testing/omp/` and `testing/cpp/` — host-backend variants
+
+Catch2-based tests use the `catch2_test_*.cu` prefix. Legacy tests use a custom
+`unittest` harness also in `testing/`.
+
+Build and run with `cccl-build` / `cccl-test`.
+
+## Additional resources
+
+- `references/execution-policies.md` — policy type hierarchy, `.on(stream)`, allocator policies, custom backends
+- `references/docs.md` — index of Thrust documentation (API reference, developer overview).
+- `references/tools.md` — build and test scripts for Thrust.
diff --git a/.agent/skills/cccl-thrust/references/docs.md b/.agent/skills/cccl-thrust/references/docs.md
new file mode 100644
index 00000000000..dece7a59484
--- /dev/null
+++ b/.agent/skills/cccl-thrust/references/docs.md
@@ -0,0 +1,20 @@
+# Documentation index — cccl-thrust
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/thrust/index.rst` | Thrust overview: high-level parallel algorithms, device systems, execution policies. |
+| `docs/thrust/api.rst` | Doxygen-extracted Thrust API reference and algorithm guide. |
+| `docs/thrust/api/index.rst` | Auto-generated API docs for Thrust algorithms and containers (and subdirectories). |
+| `docs/thrust/developer_overview.rst` | Internal architecture, backend systems, and development guide for contributors. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `thrust/examples/README.md` | Example projects demonstrating Thrust usage patterns. |
+
+## See also
+
+- `cccl_detail-test-params` for `%PARAM%` test parameterization internals.
diff --git a/.agent/skills/cccl-thrust/references/execution-policies.md b/.agent/skills/cccl-thrust/references/execution-policies.md
new file mode 100644
index 00000000000..299ab730f4a
--- /dev/null
+++ b/.agent/skills/cccl-thrust/references/execution-policies.md
@@ -0,0 +1,71 @@
+# Thrust Execution Policy Reference
+
+## Policy type hierarchy
+
+```
+thrust::execution_policy<Derived>
+├── thrust::host_execution_policy<Derived>      — base for host backends
+│   └── thrust::cpp::execution_policy<>         — serial CPU
+│   └── thrust::omp::execution_policy<>         — OpenMP CPU
+│   └── thrust::tbb::execution_policy<>         — TBB CPU
+└── thrust::device_execution_policy<Derived>    — base for device backends
+    └── thrust::cuda::execution_policy<Derived> — CUDA GPU
+        └── thrust::cuda::par_t                 — concrete CUDA policy object
+```
+
+## Built-in policy objects
+
+| Object              | Header                                        | Description                   |
+|---------------------|-----------------------------------------------|-------------------------------|
+| `thrust::seq`       | `<thrust/execution_policy.h>`                 | Serial; no parallelism        |
+| `thrust::host`      | `<thrust/execution_policy.h>`                 | Host backend (macro-selected) |
+| `thrust::device`    | `<thrust/execution_policy.h>`                 | Device backend (macro-selected) |
+| `thrust::cuda::par` | `<thrust/system/cuda/execution_policy.h>`    | CUDA backend                  |
+| `thrust::omp::par`  | `<thrust/system/omp/execution_policy.h>`     | OpenMP backend                |
+| `thrust::tbb::par`  | `<thrust/system/tbb/execution_policy.h>`     | TBB backend                   |
+
+## Stream association
+
+```cpp
+#include <thrust/system/cuda/execution_policy.h>
+
+cudaStream_t stream;
+cudaStreamCreate(&stream);
+
+// Associate all work in the call with `stream`:
+thrust::sort(thrust::cuda::par.on(stream), v.begin(), v.end());
+```
+
+`.on(stream)` returns a new policy object bound to the given CUDA stream. Algorithms
+launched with this policy run asynchronously on that stream and obey standard CUDA
+stream ordering.
+
+## Allocator-aware policies
+
+Combine a stream with a custom allocator for temporary storage:
+
+```cpp
+thrust::sort(thrust::cuda::par(my_allocator).on(stream), v.begin(), v.end());
+```
+
+Thrust uses the allocator for internal scratch buffers instead of `cudaMalloc`.
+Common use: caching allocators (e.g., `thrust::mr::pool_resource`) to avoid repeated
+device allocations in hot loops.
+
+## Custom backends
+
+Derive from `thrust::host_execution_policy<MyPolicy>` or
+`thrust::device_execution_policy<MyPolicy>` and provide ADL-visible overloads for
+the algorithms you specialise. Algorithms without an overload fall back to the
+parent class's backend.
+
+`thrust/examples/minimal_custom_backend.cu` — minimal working example.
+
+## Backend selection macros
+
+| Macro                    | Default                     | Effect                        |
+|--------------------------|-----------------------------|-----------------------------|
+| `THRUST_DEVICE_SYSTEM`   | `THRUST_DEVICE_SYSTEM_CUDA` | Sets `thrust::device` backend |
+| `THRUST_HOST_SYSTEM`     | `THRUST_HOST_SYSTEM_CPP`    | Sets `thrust::host` backend   |
+
+Backends: `CUDA`, `CPP`, `OMP`, `TBB`. Change at compile time; do not change at runtime.
diff --git a/.agent/skills/cccl-thrust/references/tools.md b/.agent/skills/cccl-thrust/references/tools.md
new file mode 100644
index 00000000000..6a57c2cf235
--- /dev/null
+++ b/.agent/skills/cccl-thrust/references/tools.md
@@ -0,0 +1,9 @@
+# Tool index — cccl-thrust
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/build_thrust.sh` | Full-matrix Thrust build: host/std/arch sweep. | `cccl-build` → `references/tools.md` |
+| `ci/test_thrust.sh` | Full-matrix Thrust test: host/std/arch sweep; requires GPU. | `cccl-test` → `references/tools.md` |
+| `ci/util/build_and_test_targets.sh` | Targeted build+test for inner-loop iteration against a single preset. | `cccl-build` → `references/build_and_test_targets_usage.md` |
diff --git a/.agent/skills/cccl-triage-nightly/SKILL.md b/.agent/skills/cccl-triage-nightly/SKILL.md
deleted file mode 100644
index f1e2e474cba..00000000000
--- a/.agent/skills/cccl-triage-nightly/SKILL.md
+++ /dev/null
@@ -1,42 +0,0 @@
----
-name: cccl-triage-nightly
-description: "Diagnose failures in the latest scheduled CCCL nightly run on `main` in the CCCL repository. Locates the run, groups failures by toolchain/project, fetches representative logs, summarizes, presents to user, and — on approval — applies fixes against a new branch, opens a draft PR, posts `/ok to test <SHA>`. Use when the user asks to triage, diagnose, or fix nightly CI. Trigger phrases: \"triage the nightly\", \"what failed in nightly\", \"diagnose latest nightly\", \"fix nightly CI\", \"investigate nightly run\"."
-argument-hint: "[run-id-or-empty]"
----
-
-# cccl-triage-nightly
-
-Same shape as `cccl-triage-pr`, but starts from a workflow run and ends by opening a fresh draft PR.
-
-Scratch dir, single-Bash discipline, worktree safety, and `cccl-clarify` routing match `cccl-triage-pr`.
-
-## Step 1 — Locate the run
-
-User-supplied run ID wins. Otherwise:
-
-```
-gh run list --workflow=ci-workflow-nightly.yml --branch=main --limit=1 --json databaseId,conclusion,createdAt,headSha > /tmp/claude/<sessionid>/nightly_run.json
-```
-
-Capture `databaseId` and `headSha`. `conclusion: success` → stop.
-
-## Step 2 — Fetch failures
-
-Dispatch `cccl-fetch-ci-failures` with the run ID.
-
-## Steps 3–7 — Group, fetch logs, summarize, present, diagnose
-
-Identical to `cccl-triage-pr` steps 3–7.
-
-## Step 8 — Ship the fix
-
-No existing branch or PR — open a fresh one.
-
-1. **Worktree safety.** Refuse on `main`. Offer to create a new named branch via `cccl-clarify`.
-2. **Apply edits.** Per-file approval via `cccl-clarify`. Offer `gh issue create` for any deferred problems.
-3. **Override matrix + skip tags.** Dispatch `cccl-ci-overrides` with `failed_jobs:` (TSV path), `paths:` (edited
-   files), `for_workflow: nightly`. Reference the nightly run ID in the override comment; skip tags apply to the
-   LAST commit only.
-4. **Commit.** Route to `cccl-commit`.
-5. **Open PR + `/ok to test`.** Route to `cccl-pr` Phase 1. PR body should reference the nightly run + per-cluster
-   diagnosis. Multiple PRs → run Phase 1 once per branch, framed via `cccl-clarify`.
diff --git a/.agent/skills/cccl-triage-pr/SKILL.md b/.agent/skills/cccl-triage-pr/SKILL.md
deleted file mode 100644
index 89ba4bde38a..00000000000
--- a/.agent/skills/cccl-triage-pr/SKILL.md
+++ /dev/null
@@ -1,85 +0,0 @@
----
-name: cccl-triage-pr
-description: "Diagnose and (optionally) fix CI failures on the current branch's open PR in the CCCL repository. Resolves the PR from the current branch, groups failed checks by likely root cause, pulls representative logs, summarizes them, presents findings, and — on user approval — applies fixes, adds override matrix + skip tags, commits, pushes, posts `/ok to test <SHA>`. Use when the user wants to investigate or fix CI failures on a PR. Trigger phrases: \"diagnose the PR\", \"fix CI on this PR\", \"what's failing in CI\", \"investigate this PR's CI\"."
-argument-hint: "[PR-number]"
----
-
-# cccl-triage-pr
-
-Route user-question moments through `cccl-clarify`. Create the scratch dir once: `mkdir -p /tmp/claude/<sessionid>`.
-
-## Step 1 — Resolve PR
-
-User-supplied PR# wins. Otherwise:
-
-```
-gh pr view --json number,title,state,url,headRefName,isDraft,headRefOid > /tmp/claude/<sessionid>/pr_meta.json
-```
-
-Capture `number` and `headRefOid`.
-
-## Step 2 — Fetch failures
-
-Dispatch `cccl-fetch-ci-failures` with the PR number. The agent writes a TSV to a path you specify
-(`/tmp/claude/<sessionid>/failed_jobs.tsv`): `(job-id, name, grouping-hint)` per row.
-
-Zero failures → report and offer to wait. If waiting, schedule `ScheduleWakeup(delaySeconds=1200)`.
-
-## Step 3 — Group + pick representatives
-
-Bucket failures by shared axes (toolchain, library, variant, platform, phase). Pick one representative JID per
-group. Don't fetch every failure's logs.
-
-## Step 4 — Pull representative logs
-
-For each representative:
-
-```
-gh api repos/NVIDIA/cccl/actions/jobs/<JID>/logs > /tmp/claude/<sessionid>/job_<JID>.log
-```
-
-Works mid-run, unlike `gh run view --log-failed`.
-
-## Step 5 — Summarize via `cccl-summarize-job-log`
-
-Dispatch one agent per log, in parallel. Each returns 5–10 lines.
-
-## Step 6 — Present + ask
-
-Compact table:
-
-```
-Group                              | Repr JID    | Likely cause             | Affected
----------------------------------- | ----------- | ------------------------ | --------
-CTK13.2 GCC15 C++20 TestNoLaunch   | 74849038365 | infra: artifact download | 1
-CTK12.0 GCC8 C++17 CUB Build       | 7484903xxxx | -Wunused-but-set-param   | 8
-```
-
-Route through `cccl-clarify` to ask which groups to dig into.
-
-## Step 7 — Diagnose accepted groups
-
-Re-read representative logs; cross-reference repo code where the error names a file or function. Present:
-
-1. **What broke** — concrete error.
-2. **Why** — root-cause hypothesis.
-3. **Suggested fix** — concrete change, "rerun — transient infra", or "needs upstream report".
-4. **Confidence** — high/medium/low + one-line reason.
-
-For infra-only failures, suggest `gh run rerun <RUN_ID> --failed`.
-
-## Step 8 — Ship the fix
-
-1. **Worktree safety.** Refuse on `main`.
-2. **Apply edits.** Per-file approval via `cccl-clarify`.
-3. **Override matrix + skip tags.** Dispatch `cccl-ci-overrides` with `failed_jobs:` (TSV path) + `paths:` (edited
-   files). Offer the YAML and tag set via `cccl-clarify`. Skip tags apply to the LAST commit only.
-4. **Commit.** Route to `cccl-commit`.
-5. **Push + `/ok to test`.** Route to `cccl-pr` Phase 4.
-
-## Pitfalls
-
-- `gh pr checks` exits 1 when any check failed — expected.
-- Avoid `gh pr view --json statusCheckRollup` — 100k+ tokens for 500-job PRs.
-- Avoid `gh run view --log-failed` mid-run; use `gh api .../jobs/<JID>/logs` instead.
-- Don't fetch every failure's logs — one representative per cluster.
diff --git a/.agent/skills/cccl-triage/SKILL.md b/.agent/skills/cccl-triage/SKILL.md
new file mode 100644
index 00000000000..ca65f42f10f
--- /dev/null
+++ b/.agent/skills/cccl-triage/SKILL.md
@@ -0,0 +1,107 @@
+---
+description: |
+  Diagnose and fix CI failures — PR mode or nightly mode. Resolves the run, clusters
+  failures, fetches logs, summarizes, and on approval applies fixes plus override matrix
+  + skip tags. Triggers: "diagnose the PR", "fix CI on this PR", "what's failing in CI",
+  "triage the nightly", "fix nightly CI".
+argument-hint: "[PR-number | run-id]"
+---
+
+# cccl-triage
+
+Handles both PR-CI and nightly-CI failure triage. The two modes share the same workflow shape
+from failure-fetch through diagnosis; they diverge only in how the run is identified and how
+the fix is shipped. Route user-question moments through `cccl-clarify`. Create the scratch dir
+once: `mkdir -p /tmp/claude/<sessionid>/triage`.
+
+## Step 1 — Identify mode and resolve the run
+
+**PR mode** — triggered by PR context or a user-supplied PR number. See `references/pr.md §Resolve PR`.
+
+**Nightly mode** — triggered by nightly/scheduled CI language or absence of a PR context. See
+`references/nightly.md §Locate the run`.
+
+Capture the run's numeric ID and the commit SHA that triggered it. Both are required for Steps 2–4.
+
+## Step 2 — Fetch failures
+
+Dispatch `cccl-ci-fetch-failures` with the resolved run ID. The agent writes a TSV to
+`/tmp/claude/<sessionid>/triage/failed_jobs.tsv`: `(job-id, name, grouping-hint)` per row.
+
+Zero failures → report and offer to wait. If waiting, `ScheduleWakeup(delaySeconds=1200)`.
+
+## Step 3 — Cluster + pick representatives
+
+Bucket failures by shared axes (toolchain, library, variant, platform, phase). Pick one
+representative job ID per cluster. Full clustering guidance in `references/common.md §Clustering`.
+
+## Step 4 — Pull representative logs
+
+For each representative job ID:
+
+```
+gh api repos/NVIDIA/cccl/actions/jobs/<JID>/logs > /tmp/claude/<sessionid>/triage/job_<JID>.log
+```
+
+Works mid-run; prefer over `gh run view --log-failed`.
+
+## Step 5 — Summarize
+
+Dispatch one `cccl-ci-summarize-job-log` agent per log, in parallel (haiku tier). Each returns
+5–10 lines. Collect summaries to `references/common.md §Log summary format` shape.
+
+## Step 6 — Present findings
+
+Compact table, one row per cluster:
+
+```
+Group                             | Repr JID     | Likely cause             | Affected
+--------------------------------- | ------------ | ------------------------ | --------
+CTK13.2 GCC15 C++20 TestNoLaunch  | 74849038365  | infra: artifact download | 1
+CTK12.0 GCC8 C++17 CUB Build      | 74849038xxx  | -Wunused-but-set-param   | 8
+```
+
+Route through `cccl-clarify` to ask which clusters to dig into further.
+
+## Step 7 — Diagnose accepted clusters
+
+Re-read representative logs; cross-reference repo code where the error names a file or function.
+Per cluster, present:
+
+1. **What broke** — concrete error.
+2. **Why** — root-cause hypothesis.
+3. **Suggested fix** — concrete change, "rerun — transient infra", or "needs upstream report".
+4. **Confidence** — high / medium / low + one-line reason.
+
+For infra-only failures, suggest `gh run rerun <RUN_ID> --failed`.
+
+## Step 8 — Ship the fix
+
+Mode-specific. See `references/pr.md §Ship` or `references/nightly.md §Ship`.
+
+Common to both modes:
+
+- Per-file edits require approval via `cccl-clarify`.
+- Dispatch `cccl-ci-overrides` with `failed_jobs:` (TSV path) and `paths:` (edited files).
+  Offer the YAML and tag set via `cccl-clarify`. Skip tags apply to the LAST commit only.
+- Commit via `cccl-commit`.
+
+## Good-enough criterion
+
+Findings presented to the user with cluster table, per-cluster diagnosis, and a concrete
+recommended action. Fix is shipped only on explicit user approval.
+
+## Hard prohibitions
+
+- Never commit or push without explicit per-step user approval.
+- Never run on `main` — refuse and offer a branch.
+- Never fetch every failure's logs — one representative per cluster.
+- Never use `gh pr view --json statusCheckRollup` — 100k+ tokens on large PRs.
+- Never use `gh run view --log-failed` mid-run; use the jobs API endpoint.
+- Never skip tags except on the last commit of a series.
+
+## Additional resources
+
+- `references/common.md` — clustering axes, log summary format, override-matrix synthesis
+- `references/pr.md` — PR-mode run resolution, path scoping, push + `/ok to test`
+- `references/nightly.md` — nightly run location, fresh branch/PR creation, for_workflow flag
diff --git a/.agent/skills/cccl-triage/references/common.md b/.agent/skills/cccl-triage/references/common.md
new file mode 100644
index 00000000000..978254f35e4
--- /dev/null
+++ b/.agent/skills/cccl-triage/references/common.md
@@ -0,0 +1,56 @@
+# cccl-triage — common reference
+
+## Clustering
+
+Bucket failures by shared axes before picking representatives. Apply in order:
+
+| Axis      | Examples                                              |
+|-----------|-------------------------------------------------------|
+| Phase     | configure / build / test / lint / upload              |
+| Library   | cub / thrust / libcudacxx / cudax / c                 |
+| Toolchain | CTK version + compiler + C++ standard                 |
+| Variant   | release / debug / ASAN / sanitizer                    |
+| Platform  | linux-amd64 / windows / arm                           |
+
+Pick the representative with the most informative log (build failures > test failures > timeouts).
+Infra failures (artifact download, runner setup, network) cluster separately regardless of toolchain.
+
+One representative job ID per cluster. Do not fetch every failure's log.
+
+## Log summary format
+
+Each `cccl-ci-summarize-job-log` dispatch returns 5–10 lines in this shape:
+
+```
+STATUS: <PASS|FAIL|INFRA|TIMEOUT|UNKNOWN>
+Phase: <configure|build|test|lint|upload>
+Error: <verbatim first error line, truncated at 120 chars>
+Context: <one sentence — what it was building or testing>
+Root cause: <hypothesis — one sentence>
+Confidence: <high|medium|low>
+Affects: <list of other clusters this root cause likely explains, or "only this cluster">
+```
+
+Collect all summaries before presenting Step 6 table. Group identical root-cause hypotheses.
+
+## Override-matrix synthesis
+
+Dispatch `cccl-ci-overrides` with:
+
+- `failed_jobs:` — path to `failed_jobs.tsv`
+- `paths:` — list of files touched by the fix
+- `for_workflow:` — `pr` (default) or `nightly`
+
+The agent returns a YAML block suitable for `ci/matrix.yaml` `workflows.override` and a set
+of `[skip-*]` commit tags. Present both to the user via `cccl-clarify` before applying.
+
+Skip tags apply to the **last** commit of the series only. After CI passes, reset
+`workflows.override` to empty and drop the skip tags before merging.
+
+## Pitfalls
+
+- `gh pr checks` exits 1 when any check failed — expected behavior, not an error.
+- `gh pr view --json statusCheckRollup` — returns 100k+ tokens on 500-job PRs; avoid.
+- `gh run view --log-failed` — unavailable mid-run; use `gh api .../jobs/<JID>/logs` instead.
+- A single root cause often explains many clusters (e.g., a new compiler warning). Confirm
+  before generating per-cluster fixes.
diff --git a/.agent/skills/cccl-triage/references/nightly.md b/.agent/skills/cccl-triage/references/nightly.md
new file mode 100644
index 00000000000..1787a4fb955
--- /dev/null
+++ b/.agent/skills/cccl-triage/references/nightly.md
@@ -0,0 +1,47 @@
+# cccl-triage — nightly mode reference
+
+## Locate the run
+
+User-supplied run ID wins. Otherwise:
+
+```
+gh run list --workflow=ci-workflow-nightly.yml --branch=main --limit=1 --json databaseId,conclusion,createdAt,headSha > /tmp/claude/<sessionid>/triage/nightly_run.json
+```
+
+Capture `databaseId` (the run ID for `cccl-ci-fetch-failures`) and `headSha`.
+`conclusion: success` → stop and report; nothing to triage.
+
+## Nightly matrix scope
+
+The nightly matrix is broader than the PR matrix — more toolchains, more platforms, more
+optional variants. Expect more clusters. When a root cause spans many clusters, confirm
+a single fix covers all before generating the override YAML.
+
+Pass `for_workflow: nightly` to `cccl-ci-overrides` so it generates nightly-scoped matrix
+overrides. Include a comment in the override YAML referencing the nightly run ID:
+
+```yaml
+# Nightly run <databaseId> — <createdAt> — <root-cause summary>
+```
+
+## Ship
+
+No existing branch — a new one is required.
+
+1. Route through `cccl-clarify` to name the fix branch (suggest `ci/fix-nightly-<date>`).
+2. Offer to create the branch via `cccl-clarify` before applying edits.
+3. Apply override YAML to `ci/matrix.yaml` `workflows.override`.
+4. Add skip tags to the last commit message (via `cccl-commit`).
+5. Route to `cccl-commit` to stage and commit.
+6. Route to `cccl-pr` Phase 1 to open a draft PR. PR body must reference:
+   - The nightly run ID and timestamp.
+   - Per-cluster diagnosis (summary table from Step 6).
+   - Any deferred problems (offer `gh issue create` for issues not addressed by the fix).
+7. Multiple independent fix branches → run `cccl-pr` Phase 1 once per branch, framed via
+   `cccl-clarify`.
+
+## Deferred problems
+
+Failures that need upstream action or are non-trivial to fix in a single PR should be offered
+as GitHub issues (`gh issue create`) rather than left as unaddressed clusters in the fix PR.
+Route through `cccl-clarify` to confirm before creating issues.
diff --git a/.agent/skills/cccl-triage/references/pr.md b/.agent/skills/cccl-triage/references/pr.md
new file mode 100644
index 00000000000..7987920bb2e
--- /dev/null
+++ b/.agent/skills/cccl-triage/references/pr.md
@@ -0,0 +1,35 @@
+# cccl-triage — PR mode reference
+
+## Resolve PR
+
+User-supplied PR number wins. Otherwise:
+
+```
+gh pr view --json number,title,state,url,headRefName,isDraft,headRefOid > /tmp/claude/<sessionid>/triage/pr_meta.json
+```
+
+Capture `number` (for `cccl-ci-fetch-failures`) and `headRefOid` (the commit SHA for
+`/ok to test`). If no open PR exists on the current branch, route through `cccl-clarify`
+to ask for a PR number.
+
+## Path scoping
+
+When the user asks about failures in specific libraries or paths, filter `failed_jobs.tsv`
+before clustering by intersecting the `name` column against the relevant path prefixes
+(e.g., `cub/`, `thrust/`, `libcudacxx/`). This is optional — full-matrix triage skips it.
+
+## Ship
+
+After edits are approved and `cccl-ci-overrides` output is accepted:
+
+1. Apply override YAML to `ci/matrix.yaml` `workflows.override`.
+2. Add skip tags to the last commit message (via `cccl-commit`).
+3. Route to `cccl-commit` to stage and commit.
+4. Route to `cccl-pr` Phase 4 to push and post `/ok to test <headRefOid>`.
+
+The PR already exists; no PR creation step.
+
+## Worktree safety
+
+Refuse on `main`. The PR must be on a feature branch. If the working tree is on `main`,
+route through `cccl-clarify` — the user likely needs to check out the PR branch first.
diff --git a/.agent/skills/cccl/SKILL.md b/.agent/skills/cccl/SKILL.md
index a2a229c7c07..c6059f2c189 100644
--- a/.agent/skills/cccl/SKILL.md
+++ b/.agent/skills/cccl/SKILL.md
@@ -1,39 +1,43 @@
 ---
-name: cccl
-description: "Entry-point orientation for the CCCL repository. Surfaces the available CCCL-specific skills and agents and points at common entry phrases. Load this skill first in every CCCL session before doing other work. Use when starting any task in this repo, when unsure which CCCL skill to use, or when introduced to the repo cold."
+description: "Entry-point router for the cccl-* skill and agent family. Load first in any CCCL session. Routes by intent: commit, branch, PR, CI, build/test, libraries, infrastructure, benchmarks, docs. Triggers: \"where do I start\", \"which skill should I use\", \"new to this repo\", \"what cccl skill handles X\"."
 ---
 
 # cccl
 
-Skills live in `.agent/skills/`; agents live in `.agent/agents/`. `.claude/skills` and `.claude/agents` symlink to
-those so Claude Code and Codex find the same files.
+Entry-point router for the cccl-* skills and agents.
 
-If you don't know how skill or agent invocation works, load `cccl-agent-impl` first.
+Skills live under `.agent/skills/<name>/SKILL.md`; agents under `.agent/agents/<name>.md`. `.claude/skills` and `.claude/agents` are directory symlinks to those.
+
+Entry skills (`cccl-*`) are slash-completable workflow entry points. Detail skills (`cccl_detail-*`) auto-load via description match and are excluded from slash autocomplete. See `references/skills-and-agents.md` for the full catalog and invocation mechanics.
 
 ## Where to start by intent
 
-| Intent                                         | Load                                                |
-|------------------------------------------------|-----------------------------------------------------|
-| Commit uncommitted changes / wrap up a fix     | `cccl-commit`                                       |
-| Resplit / clean up a branch's commit history   | `cccl-resplit-branch`                               |
-| Open / edit / comment on a PR / trigger CI     | `cccl-pr`                                           |
-| Diagnose CI on this PR / what's failing        | `cccl-triage-pr`                                    |
-| Triage nightly / fix nightly CI                | `cccl-triage-nightly`                               |
-| Stuck on a decision / should I X or Y          | `cccl-clarify`                                      |
-| Post `/ok to test`                             | `cccl-ok-to-test` agent (called by a skill)         |
-| Generate override matrix / skip tags           | `cccl-ci-overrides` agent (called by a skill)       |
-| How does CI work / where is X CI defined       | `cccl-ci`                                           |
-| Set up a benchmark on this PR                  | `cccl-ci-benchmarks`                                |
-| Git bisect a regression                        | `cccl-bisect`                                       |
-| Build / test in the devcontainer               | `cccl-devcontainers`, `cccl-build-and-test-targets` |
-| Build cub / thrust / libcudacxx / cudax (full) | `cccl-cpp-builds`                                   |
-| Work on / build / test the python bindings     | `cccl-python`                                       |
-| Check for SASS/PTX changes                     | `cccl-sass-diff`                                    |
-| libcudacxx code style                          | `cccl-libcudacxx-style`                             |
-
-## Repo conventions
-
-- **Scratch space**: `/tmp/claude/<sessionid>/`. Create with `mkdir -p`. Don't pipe; redirect to a file and Read.
-- **CI** uses `ci/matrix.yaml` with optional `workflows.override` to scope PR jobs; `[skip-*]` commit tags scope
-  further. Both block merging while present. See `cccl-ci`.
-- **`/ok to test <SHA>`** is required from a maintainer for external PRs. The `cccl-ok-to-test` agent posts it.
+| Intent                               | Load                     |
+|--------------------------------------|--------------------------|
+| Commit / wrap up a fix               | `cccl-commit`            |
+| Resplit / clean up commit history    | `cccl-resplit-branch`    |
+| Open / edit / comment on a PR        | `cccl-pr`                |
+| Diagnose CI failures (PR or nightly) | `cccl-triage`            |
+| Stuck on a decision                  | `cccl-clarify`           |
+| CI overview / matrix / skip tags     | `cccl-ci`                |
+| Benchmarks / perf comparisons        | `cccl-bench`             |
+| Git bisect a regression              | `cccl-bisect`            |
+| Build a target                       | `cccl-build`             |
+| Run tests                            | `cccl-test`              |
+| Launch a devcontainer                | `cccl-devcontainer`      |
+| Cross-functional infra               | `cccl-infra`             |
+| Python bindings                      | `cccl-python`            |
+| SASS / PTX comparison                | `cccl-sass-diff`         |
+| Pre-commit / linters / formatters    | `cccl-precommit`         |
+| CMake presets / configuration        | `cccl-cmake`             |
+| Sphinx docs / Doxygen                | `cccl-docs`              |
+| libcudacxx                           | `cccl-libcudacxx`        |
+| CUB                                  | `cccl-cub`               |
+| Thrust                               | `cccl-thrust`            |
+| cudax                                | `cccl-cudax`             |
+| C Parallel Library                   | `cccl-c`                 |
+
+## Additional resources
+
+- `references/skills-and-agents.md` — full entry-skill catalog, detail-skill catalog, agent catalog, naming convention, invocation mechanics.
+- `references/docs.md` — index of top-level CCCL orientation documentation.
diff --git a/.agent/skills/cccl/references/docs.md b/.agent/skills/cccl/references/docs.md
new file mode 100644
index 00000000000..d1a9c82ad1e
--- /dev/null
+++ b/.agent/skills/cccl/references/docs.md
@@ -0,0 +1,27 @@
+# Documentation index — cccl
+
+Top-level orientation documents. Relevant to any CCCL session as background context.
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `README.md` | Welcome guide: mission, the three core libraries (CUB, Thrust, libcudacxx), quick links. |
+| `CONTRIBUTING.md` | Getting started: fork, branch conventions, devcontainer setup, pre-commit, first PR. |
+| `AGENTS.md` | Agent-specific instructions for building, testing, and contributing to CCCL. |
+| `docs/index.rst` | Main Sphinx landing page; links to C++, Python, and maintainer doc sections. |
+| `docs/cpp.rst` | C++ libraries landing page (CUB, Thrust, libcudacxx, cudax). |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cccl/config_macros.rst` | CMake and compile-time configuration options across CCCL libraries. |
+| `docs/cccl/3.0_migration_guide.rst` | Breaking changes and upgrade path from CCCL 2.x to 3.x. |
+| `docs/cccl/tma.rst` | Tensor Memory Accelerator (TMA) hardware feature: API and usage in CCCL. |
+| `ci-overview.md` | CI environment, matrix.yaml structure, skip tags, override matrix, troubleshooting. |
+
+## See also
+
+- `cccl-build` `references/docs.md`, `cccl-test` `references/docs.md` — build/test documentation.
+- Per-library skills for library-specific docs (cccl-cub, cccl-thrust, cccl-libcudacxx, cccl-cudax).
diff --git a/.agent/skills/cccl/references/skills-and-agents.md b/.agent/skills/cccl/references/skills-and-agents.md
new file mode 100644
index 00000000000..cdf1d6df9c7
--- /dev/null
+++ b/.agent/skills/cccl/references/skills-and-agents.md
@@ -0,0 +1,92 @@
+# Skills and agents — catalog and invocation mechanics
+
+## Skill invocation mechanics
+
+Skills are loaded by the harness via description-match (every session turn, every word in `description:` is
+indexed for intent matching). When a user phrase triggers a skill, the harness loads the SKILL.md body for
+that session. References (`references/*.md`) are **not** loaded automatically — the skill body loads on demand
+by instructing the orchestrator to read a specific path.
+
+Key behaviors:
+
+- `description:` loads every turn — keep it 30–60 words; long descriptions are an anti-pattern.
+- Skill body loads every trigger — keep to the essential workflow spine; push edge cases and templates to `references/`.
+- Invoke via the **Skill tool** with `skill: <name>`. Skills are not reentrant.
+- Entry skills (`cccl-*`) appear in slash autocomplete (`/cccl-` prefix). Detail skills (`cccl_detail-*`) do not.
+
+## Agent invocation mechanics
+
+Agents live at `.agent/agents/<name>.md`. Dispatch via the **Agent tool** with `subagent_type: <name>` and an
+explicit `model:` parameter — the per-call value overrides frontmatter. Model tier:
+
+- `haiku` — mechanical tasks: log parsing, SHA verification, JSON extraction.
+- `sonnet` — multi-file reasoning or judgment (e.g. generating override matrices).
+
+CCCL agents are leaf agents: non-interactive (no `AskUserQuestion`), no spawning subagents. User dialogue
+belongs in the calling skill.
+
+## Entry-skill catalog
+
+| Skill                 | Purpose                                                                 |
+|-----------------------|-------------------------------------------------------------------------|
+| `cccl`                | Entry-point router; load first in any session                           |
+| `cccl-infra`          | Cross-functional infrastructure: CTK bumps, release cycles, CI/cmake/precommit/devcontainer/github/examples workflows |
+| `cccl-bisect`         | Git bisect a regression via cloud workflow or local devcontainer        |
+| `cccl-build`          | Configure and build targets; single-target and subproject builds        |
+| `cccl-test`           | Run tests: ctest, lit, targeted regex; smallest-scope-first             |
+| `cccl-ci`             | CI overview: matrix, skip tags, agents, override matrix, bench stub     |
+| `cccl-clarify`        | Decision-point escalation for ambiguous choices or tricky tradeoffs     |
+| `cccl-commit`         | Interactive commit prep: survey, split, chunks, message, commit         |
+| `cccl-devcontainer`   | Launch devcontainers; Docker workflow; container selection               |
+| `cccl-pr`             | Open, edit, comment on PRs; push + trigger CI                           |
+| `cccl-python`         | Python bindings (cuda-cccl): build, test, publish                       |
+| `cccl-resplit-branch` | Rebase and resplit a branch's commit history into a clean series        |
+| `cccl-sass-diff`      | SASS/PTX comparison between two refs                                    |
+| `cccl-triage`         | Diagnose CI failures on a PR or the latest nightly; optionally fix      |
+| `cccl-docs`           | Sphinx docs, Doxygen, gen_docs.bash, docs-deploy.yml                    |
+| `cccl-cmake`          | CMake presets, configuration options, usage entry point                 |
+| `cccl-precommit`      | Pre-commit hooks, clang-format, ruff, gersemi, codespell, shellcheck    |
+| `cccl-bench`          | Write benchmarks, nvbench, cccl.bench Python harness, CI bench requests |
+| `cccl-libcudacxx`     | libcudacxx tour: stdlib, style, headers, internal macros                |
+| `cccl-cub`            | CUB tour: block primitives, algorithms, device API, tests               |
+| `cccl-thrust`         | Thrust tour: algorithms, policies, execution backends, tests            |
+| `cccl-cudax`          | cudax tour: experimental features, containers, streams                  |
+| `cccl-c`              | C Parallel Library: C bindings for CCCL algorithms                      |
+
+## Detail-skill catalog
+
+Detail skills auto-load via description match. They do **not** appear in slash autocomplete.
+
+| Skill                             | Purpose                                                                          |
+|-----------------------------------|----------------------------------------------------------------------------------|
+| `cccl_detail-cmake`               | Buildsystem internals, arch variants, CMake helpers index                        |
+| `cccl_detail-ci`                  | CI internals deep dive: workflow-build action, skip-tag plumbing, downstream     |
+| `cccl_detail-release`             | Versioning pipeline: `cccl-version.json`, `update_version.sh`, release workflows, Python wheel versioning |
+| `cccl_detail-github`              | Templates, CODEOWNERS, copy-pr-bot, CodeRabbit, non-CI workflows                 |
+| `cccl_detail-examples`            | Top-level `examples/` CPM-consumption tests via `cccl_add_compile_test`          |
+| `cccl_detail-test-params`         | `%PARAM%` convention via `cmake/CCCLTestParams.cmake`                            |
+| `cccl_detail-cpp-macros`          | `_CCCL_*` internal macros: compiler detection, visibility, ABI, diagnostics      |
+| `cccl_detail-devcontainer-matrix` | `make_devcontainers.sh`, 60+ container configs from `ci/matrix.yaml`, `verify-devcontainers.yml` |
+
+## Agent catalog
+
+| Agent                       | Model    | Purpose                                                            |
+|-----------------------------|----------|--------------------------------------------------------------------|
+| `cccl-ci-overrides`         | `sonnet` | Generate CI override matrix and skip-tag recommendations           |
+| `cccl-ci-fetch-failures`    | `haiku`  | Fetch and parse failed CI job logs; return structured failure list |
+| `cccl-ci-summarize-job-log` | `haiku`  | Summarize a single CI job log; return structured digest            |
+
+## Naming convention
+
+Format: `cccl[_detail]-<subarea>[-<topic>]*`. Entry skills use the `cccl-` prefix (slash-completable). Detail and
+helper skills use `cccl_detail-` — the underscore-detail marker sits between `cccl` and the first kebab dash,
+keeping them out of `/cccl-` autocomplete. Multiple topic suffixes allowed for hierarchical nesting (e.g.
+`cccl_detail-cmake-helpers`). When naming, drop tokens that duplicate the parent directory path; do not
+abbreviate semantic content.
+
+## Where these files live
+
+- Skills: `.agent/skills/<name>/SKILL.md`
+- Skill references: `.agent/skills/<name>/references/*.md` (on-demand only; not auto-loaded)
+- Agents: `.agent/agents/<name>.md`
+- `.claude/skills` and `.claude/agents` are directory symlinks to `.agent/skills` and `.agent/agents` respectively
diff --git a/.agent/skills/cccl_detail-ci/SKILL.md b/.agent/skills/cccl_detail-ci/SKILL.md
new file mode 100644
index 00000000000..e26bb482c0f
--- /dev/null
+++ b/.agent/skills/cccl_detail-ci/SKILL.md
@@ -0,0 +1,139 @@
+---
+description: |
+  CCCL CI plumbing internals — matrix expansion, inspect_changes, skip-tag and
+  override mechanics, copy-pr-bot, result aggregation, workflow-build action,
+  devcontainer launch from CI. Auto-loads for deep questions about how the CI
+  pipeline works, not just what it produces.
+  Triggers: "how does CI work internally", "matrix expansion", "inspect_changes details",
+  "build-workflow.py", "skip-tag mechanics", "copy-pr-bot internals".
+---
+
+Deep-internals reference for CCCL's CI pipeline. Covers the implementation
+layer beneath the user-facing overview in `ci-overview.md`.
+
+## Matrix expansion pipeline
+
+Every CI run (PR, nightly, weekly) starts with a `build-workflow` job on
+`ubuntu-latest`. That job calls the `.github/actions/workflow-build` composite
+action, which:
+
+1. Optionally runs `ci/inspect_changes.py` to classify which projects are dirty
+   (PR mode only).
+2. Calls `build-workflow.py ci/matrix.yaml --workflows <name>` with
+   `--full-build-projects` and `--lite-build-projects` populated from step 1.
+3. Calls `prepare-workflow-dispatch.py` to shape the output into dispatch-ready
+   JSON.
+4. Exports the result as the `workflow` output and uploads the `workflow/`
+   artifact.
+
+The parent workflow fans out into four dispatch-group jobs
+(`dispatch-groups-linux-two-stage`, `-windows-two-stage`,
+`-linux-standalone`, `-windows-standalone`), each receiving a slice of the
+expanded matrix.
+
+See `references/matrix-expansion.md` for `build-workflow.py` internals: tag
+explosion, std aliases, two-stage producer/consumer grouping, GUID assignment,
+and the override path.
+
+## inspect_changes.py and project scoping
+
+On PRs, `inspect_changes.py` diffs `base_sha..HEAD` using
+`ci/project_files_and_dependencies.yaml` to classify every changed file. Files
+not matched by any project fall into the `core` bucket; any `core` hit triggers
+a full rebuild of all projects.
+
+For non-core files, the script propagates rebuild requirements through the
+dependency graph:
+
+- `full_dependencies` — dirty dep → dependent gets `FULL_BUILD` (full workflow)
+- `lite_dependencies` — dirty dep → dependent gets `LITE_BUILD` (`_lite`
+  workflow variant if one exists, otherwise falls back to full)
+- Transitive lite dependencies are computed at config-load time
+
+Outputs are space-separated lists emitted as `FULL_BUILD` and `LITE_BUILD`
+GitHub Actions step outputs, then passed to `build-workflow.py` via
+`--full-build-projects` / `--lite-build-projects`.
+
+See `references/inspect-changes.md` for the full dependency graph, `core`
+semantics, and `ignore_regexes` list.
+
+## Skip-tag and override mechanics
+
+Skip tags are read in the `build-workflow` job's **Export workflow flags** step,
+directly from `github.event.head_commit.message`. Each tag sets a boolean output
+(`matrix_enabled`, `vdc_enabled`, `docs_enabled`, etc.) that gates downstream
+jobs via `if:` conditions.
+
+`[bench-only]` is a composite alias: it sets `matrix_enabled=false`,
+`vdc_enabled=false`, `docs_enabled=false`, and enables the bench path.
+
+The override matrix is processed inside `build-workflow.py`: if
+`workflows.override` in `ci/matrix.yaml` is non-empty and `--allow-override` is
+passed (PR mode only), the override list replaces the requested workflow
+entirely. The `workflow-results` action fails the workflow if `override.json`
+exists, ensuring overrides block merging.
+
+Tags and overrides can be combined. Skip tags apply at the job-dispatch layer;
+the override matrix applies at the matrix-expansion layer.
+
+## copy-pr-bot and `/ok to test`
+
+CCCL uses NVIDIA's `copy-pr-bot` GitHub App. On external PRs, the bot does
+nothing until a repository member comments `/ok to test <SHA>`. The bot
+verifies the SHA matches the PR head, then copies the branch to a
+`pull-request/<N>` prefixed branch. The `pull_request` workflow triggers on
+pushes to `pull-request/[0-9]+`, not on the originating PR branch.
+
+Internal contributors with SSH-signed commits skip the bot; signed pushes to
+the `pull-request/<N>` branch trigger CI automatically.
+
+`additional_trustees` in `.github/copy-pr-bot.yaml` grants `/ok to test`
+permission to listed contributors beyond the default set.
+
+See `references/copy-pr-bot.md` for the full flow and SSH signing setup.
+
+## Per-job runner setup (workflow-run-job-linux)
+
+Each dispatched job runs inside `.github/actions/workflow-run-job-linux`
+(Windows: `-windows`). The action:
+
+1. Sets `JOB_CUDA`, `JOB_HOST`, `JOB_IMAGE`, `JOB_RUNNER`, `JOB_ENVIRONMENT`
+   as env vars.
+2. Fetches AWS credentials for the sccache bucket via OIDC.
+3. Launches `.devcontainer/launch.sh --docker --cuda $JOB_CUDA --host
+   $JOB_HOST` with `--gpus device=...` when on a GPU runner.
+4. Inside the container, `eval "${COMMAND}"` runs the CI script (e.g.
+   `"./ci/build_cub.sh" -std "17" -arch "80-real"`).
+5. On failure, prints a reproducer block with the exact `launch.sh` invocation.
+6. Calls `ci/upload_job_result_artifacts.sh` unconditionally to record
+   pass/fail for result aggregation.
+
+## Result aggregation (workflow-results)
+
+After all dispatch groups finish, `verify-workflow` calls
+`.github/actions/workflow-results`. It:
+
+1. Downloads the `workflow/` artifact (job manifest from `build-workflow`).
+2. Downloads all `zz_jobs-*` artifacts (success/fail records from each job).
+3. Runs `verify-job-success.py workflow/job_ids.json` — checks that every job
+   ID in the manifest has a corresponding success artifact.
+4. Runs `prepare-execution-summary.py` and `final-summary.py` to build the PR
+   comment table.
+5. Posts a sticky PR comment via `marocchino/sticky-pull-request-comment`.
+6. Fails the step if `workflow/override.json` exists (blocks merge when
+   override is active).
+
+Nightly/weekly runs additionally post Slack notifications on start and
+finish/failure.
+
+## Additional resources
+
+- `references/docs.md` — index of CCCL CI documentation.
+- `references/tools.md` — CI plumbing scripts with ownership and cross-references.
+- `references/inspect_changes_usage.md` — `ci/inspect_changes.py` CLI interface and examples.
+- `references/matrix-expansion.md` — `build-workflow.py` internals: tag
+  explosion, two-stage grouping, GUID assignment, override path
+- `references/inspect-changes.md` — dependency graph, `core` semantics,
+  `ignore_regexes`, full/lite classification rules
+- `references/copy-pr-bot.md` — copy-pr-bot flow, `/ok to test` mechanics,
+  SSH signing
diff --git a/.agent/skills/cccl_detail-ci/references/copy-pr-bot.md b/.agent/skills/cccl_detail-ci/references/copy-pr-bot.md
new file mode 100644
index 00000000000..648d7d84139
--- /dev/null
+++ b/.agent/skills/cccl_detail-ci/references/copy-pr-bot.md
@@ -0,0 +1,83 @@
+# copy-pr-bot and /ok to test
+
+## Why it exists
+
+CCCL uses NVIDIA self-hosted GPU runners. For security, PR code from external
+contributors must not run on those runners until reviewed. `copy-pr-bot`
+enforces this gate by copying approved code to a separate branch that the
+`pull_request` workflow actually watches.
+
+## Trigger: the pull_request workflow
+
+```yaml
+on:
+  push:
+    branches:
+      - "pull-request/[0-9]+"
+```
+
+The workflow does not trigger on `pull_request` events. It triggers on pushes
+to branches named `pull-request/<N>`. The bot creates and updates these
+branches; direct pushes are also accepted for internal contributors.
+
+## External contributor flow
+
+1. Contributor opens a PR from a fork or feature branch.
+2. Reviewer inspects the changes.
+3. Reviewer comments `/ok to test <SHA>` where `<SHA>` is the PR head SHA.
+4. copy-pr-bot verifies the SHA matches `github.event.pull_request.head.sha`.
+5. Bot pushes the code to `pull-request/<PR number>`.
+6. The `pull_request` workflow triggers.
+
+The SHA verification prevents reviewers from accidentally approving a commit
+that was pushed after their review.
+
+## Internal contributor flow
+
+Internal contributors with SSH-signed commits do not need `/ok to test`. A
+signed push to the PR branch (which is typically already a `pull-request/<N>`
+branch for internal contributors) triggers CI automatically.
+
+SSH signing setup:
+
+```bash
+git config --global gpg.format ssh
+git config --global user.signingKey ~/.ssh/<KEY>.pub
+git config --global commit.gpgsign true
+```
+
+Upload the key as a **Signing Key** (not just authentication key) at
+`github.com/settings/keys`.
+
+## additional_trustees
+
+`.github/copy-pr-bot.yaml`:
+
+```yaml
+additional_trustees:
+  - ahendriksen
+  - gonzalobg
+```
+
+Users listed here may use `/ok to test` in addition to the default set
+(NVIDIA org members with write access). The `auto_sync_draft` field controls
+whether draft PRs are auto-synced; it is `false` in CCCL.
+
+## Concurrency and re-runs
+
+The `pull_request` workflow has:
+
+```yaml
+concurrency:
+  group: ${{ github.workflow }}-on-${{ github.event_name }}-from-${{ github.ref_name }}
+  cancel-in-progress: true
+```
+
+A new `/ok to test` (new push to the `pull-request/<N>` branch) cancels any
+in-progress run for that PR.
+
+## No automatic run for new commits
+
+Each new commit on an external PR requires a new `/ok to test <SHA>`. The SHA
+must match the new head exactly. This prevents a reviewer from approving old
+code that has since been superseded.
diff --git a/.agent/skills/cccl_detail-ci/references/docs.md b/.agent/skills/cccl_detail-ci/references/docs.md
new file mode 100644
index 00000000000..653bc06508b
--- /dev/null
+++ b/.agent/skills/cccl_detail-ci/references/docs.md
@@ -0,0 +1,17 @@
+# Documentation index — cccl_detail-ci
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `ci-overview.md` | CI environment, matrix.yaml structure, skip tags, override matrix, `/ok to test` policy, and troubleshooting commands. User-facing CI reference; this skill covers the implementation layer beneath it. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/cccl/contributing.rst` | Repository structure, build workflow, testing, and CI guidelines (Sphinx version of CONTRIBUTING.md). |
+
+## See also
+
+- `cccl-ci` `references/docs.md` for the user-facing CI skill's documentation pointers.
diff --git a/.agent/skills/cccl_detail-ci/references/inspect-changes.md b/.agent/skills/cccl_detail-ci/references/inspect-changes.md
new file mode 100644
index 00000000000..ad62c7820cc
--- /dev/null
+++ b/.agent/skills/cccl_detail-ci/references/inspect-changes.md
@@ -0,0 +1,104 @@
+# inspect_changes.py — project change classification
+
+## Source files
+
+- `ci/inspect_changes.py` — main script
+- `ci/project_files_and_dependencies.yaml` — project definitions and dep graph
+
+## Classification mechanics
+
+`inspect_changes.py --refs <base_sha> <head_sha>` diffs the two refs with
+`git diff --name-only` using the merge-base (not base directly). Shallow repos
+are unshallowed first.
+
+Each path is tested against global `ignore_regexes` first. Ignored paths are
+tracked separately and do not affect project dirty state.
+
+For surviving paths, each non-`core` project is tested in order:
+
+1. Path matches any `include_regexes` (anchored to repo root).
+2. Path does not match any `exclude_regexes`.
+3. Path is not in the file set of any `exclude_project_files` project.
+
+A path may match multiple projects.
+
+### core semantics
+
+`core` collects all paths not matched by any non-core project. Any file in
+`core` triggers `project_statuses = {all: "Dirty"}` — a full rebuild of every
+project with a `matrix_project`. This is the "catch-all" for infra changes.
+
+### Dependency propagation
+
+After initial dirty classification, the script propagates through the reverse
+dependency graph:
+
+```
+full_dependency edge  → propagate only at depth==0  → FULL_BUILD
+lite_dependency edge  → any depth                   → LITE_BUILD
+transitive lite dep   → computed at load time        → LITE_BUILD
+```
+
+A project in `FULL_BUILD` is tested with the full workflow variant.
+A project in `LITE_BUILD` only is tested with the `_lite` workflow variant
+(if one exists) or the full workflow as fallback.
+
+Projects without a `matrix_project` field (e.g. `libcudacxx_public`,
+`cub_public`, `c2h`) are internal dependency nodes only — they never appear in
+`FULL_BUILD`/`LITE_BUILD` outputs. They exist to allow public-API vs
+internal-files distinction: a change to `cub/cub/` (public API) propagates
+differently than a change to `cub/test/` (internal).
+
+### Output format
+
+```
+FULL_BUILD=libcudacxx cub thrust
+LITE_BUILD=cudax
+```
+
+Written to `$GITHUB_OUTPUT` and also printed to stdout. The `--summary`
+optional flag writes a markdown table to a file (used by `action.yml` to
+produce the `workflow/changes.md` PR comment section).
+
+## ignore_regexes list (selected entries)
+
+| Pattern                                    | Rationale                                  |
+|--------------------------------------------|---------------------------------------------|
+| `.+\.md$`                                  | Documentation-only changes never affect build |
+| `\.branch_notes/`                          | Local scratch notes, not repo content       |
+| `\.claude/`                                | Agent scaffolding                           |
+| `docs/`                                    | Sphinx source, not headers or scripts       |
+| `ci/bench.*yaml`                           | Bench config, not build/test logic          |
+| `.github/workflows/bench.*\.yml`           | Bench workflows                             |
+| `.github/workflows/verify-devcontainers\.yml` | VDC workflow                                |
+
+Full list in `ci/project_files_and_dependencies.yaml` under `ignore_regexes`.
+
+## Public/internal split pattern
+
+Projects with large dependency surfaces use a two-key split:
+
+```
+cub_public    (include: cub/cub/)          — no matrix_project, dep node only
+cub_internal  (include: cub/, exclude: cub_public)  — matrix_project: "cub"
+```
+
+A change to `cub/cub/` marks `cub_public` dirty and propagates to all
+dependents via `full_dependencies` or `lite_dependencies`. A change to
+`cub/test/` marks only `cub_internal` dirty — dependents of `cub_public` are
+unaffected.
+
+## Project dependency graph (current)
+
+| Project key                  | Depends on (full)          | Depends on (lite)                |
+|------------------------------|---------------------------|----------------------------------|
+| `libcudacxx_internal`        | `libcudacxx_public`       | `c2h`                            |
+| `cub_internal`               | `cub_public`              | `c2h`, `nvbench_helper`          |
+| `thrust_internal`            | `thrust_public`           | `nvbench_helper`                 |
+| `cudax_internal`             | `cudax_public`            | `c2h`, `nvbench_helper`          |
+| `cccl_c_parallel_internal`   | `cccl_c_parallel_public`  | `c2h`                            |
+| `cccl_c_parallel_hostjit`    | `cccl_c_parallel_public`  | `libcudacxx_public`              |
+| `python`                     | —                         | `cccl_c_parallel_public`         |
+| `tidy`                       | all public+internal keys  | —                                |
+| `libcudacxx_public`          | —                         | `thrust_public`, `cub_public`    |
+| `cub_public`                 | —                         | `libcudacxx_public`, `thrust_public` |
diff --git a/.agent/skills/cccl_detail-ci/references/inspect_changes_usage.md b/.agent/skills/cccl_detail-ci/references/inspect_changes_usage.md
new file mode 100644
index 00000000000..4cd28904260
--- /dev/null
+++ b/.agent/skills/cccl_detail-ci/references/inspect_changes_usage.md
@@ -0,0 +1,75 @@
+# `ci/inspect_changes.py` usage
+
+Classifies CCCL subprojects as dirty or clean given a set of changed files. Used in the CI
+`build-workflow` job to prune the test matrix to only the projects affected by a PR's changes.
+Also useful locally to predict which CI jobs a set of edits will trigger.
+
+## Location
+
+`ci/inspect_changes.py`. Run from the repo root (or anywhere — it resolves the repo root from its
+own path). Requires Python 3 and `pyyaml`. Config file: `ci/project_files_and_dependencies.yaml`.
+
+## Interface
+
+```
+usage: inspect_changes.py [-h] (--refs BASE HEAD | --file PATH | --stdin) [--summary PATH]
+
+Identify which CCCL projects require rebuilds between two commits.
+
+options:
+  -h, --help        show this help message and exit
+  --refs BASE HEAD  Compare two refs using 'git diff --name-only' to determine dirty files
+  --file PATH       Read dirty file paths (one per line) from PATH
+  --stdin           Read dirty file paths (one per line) from stdin
+  --summary PATH    Optional path to write a markdown summary table
+```
+
+## Options
+
+| Flag | Required? | Description |
+|------|-----------|-------------|
+| `--refs BASE HEAD` | Yes* | Two git refs; diffs from `merge-base(BASE, HEAD)` to `HEAD`. |
+| `--file PATH` | Yes* | Newline-separated file of changed paths; bypasses git. |
+| `--stdin` | Yes* | Reads changed paths from stdin; bypasses git. |
+| `--summary PATH` | No | Writes a Markdown table of project statuses to `PATH`. Used by CI to produce the `workflow/changes.md` section. |
+
+\* Exactly one of `--refs`, `--file`, or `--stdin` is required.
+
+## Output
+
+```
+FULL_BUILD=libcudacxx cub thrust
+LITE_BUILD=cudax
+```
+
+Printed to stdout and, when running inside GitHub Actions, written to `$GITHUB_OUTPUT` as step
+outputs. `FULL_BUILD` projects run the full workflow; `LITE_BUILD` projects run the `_lite`
+variant (or full as fallback). Empty means no projects need rebuilding.
+
+## Examples
+
+```bash
+# Check which projects a PR's changes touch (using the PR's base SHA)
+python3 ci/inspect_changes.py --refs origin/main HEAD
+
+# Predict CI impact of a specific set of files
+echo "cub/cub/block/block_reduce.cuh" | python3 ci/inspect_changes.py --stdin
+
+# Write a markdown summary alongside stdout output
+python3 ci/inspect_changes.py --refs origin/main HEAD --summary /tmp/changes.md
+
+# Check a file list saved by git diff
+git diff --name-only origin/main HEAD > /tmp/changed.txt
+python3 ci/inspect_changes.py --file /tmp/changed.txt
+```
+
+## Notes / gotchas
+
+- Uses the **merge-base** of `BASE` and `HEAD`, not `BASE` itself, to avoid false-positives when
+  `HEAD` has drifted behind `origin/main`.
+- Shallow repos are automatically unshallowed before diffing.
+- Paths matching `ignore_regexes` in `project_files_and_dependencies.yaml` (e.g. `.md` files,
+  `docs/`, `.claude/`) are silently excluded — changes to those paths never trigger rebuilds.
+- Any file not matched by a named project lands in `core`, which triggers a full rebuild of every
+  project. Unknown paths are conservative.
+- For dependency propagation details, see `references/inspect-changes.md`.
diff --git a/.agent/skills/cccl_detail-ci/references/matrix-expansion.md b/.agent/skills/cccl_detail-ci/references/matrix-expansion.md
new file mode 100644
index 00000000000..941f8821362
--- /dev/null
+++ b/.agent/skills/cccl_detail-ci/references/matrix-expansion.md
@@ -0,0 +1,82 @@
+# Matrix expansion internals
+
+## build-workflow.py pipeline
+
+`ci/matrix.yaml` is the single source of truth. `build-workflow.py` reads it
+and produces the dispatch JSON used by the parent workflow.
+
+### Tag processing order
+
+For each matrix entry:
+
+1. **Validate** — required tags checked; unknown tags error.
+2. **Set defaults** — missing optional tags filled from `matrix.yaml["tags"][tag]["default"]`.
+3. **Explode** — any tag whose value is a list is exploded into N separate
+   matrix jobs. `jobs` and `environment` are left unexploded (they are passed
+   as lists to optimize scheduling).
+4. **Canonicalize** — CTK aliases resolved (`"latest"` → `"13.X"`), host
+   compiler version aliases resolved (`"gcc"` → `"gcc15"`).
+5. **Set derived tags** — `std: "all"/"min"/"max"/"minmax"` resolved by
+   intersecting CTK, host compiler, device compiler, and project std sets.
+   `sm: "gpu"` replaced with the GPU's actual sm string.
+6. **Re-explode** — derived tags that produce lists are exploded again.
+7. **Apply exclusions** — `workflows.exclude` entries matched against each job;
+   partial matches trim the list values rather than dropping the whole job.
+
+### Two-stage grouping
+
+Jobs with a `needs:` dependency in `matrix.yaml["jobs"]` become two-stage
+(producer/consumer). A `build` job is the producer; `test*` jobs are consumers
+that declare `needs: build`.
+
+`finalize_workflow_dispatch_groups`:
+- Merges consumers when multiple entries produce the same producer.
+- Removes standalone duplicates of jobs that also appear as producers.
+- Deduplicates standalone jobs.
+- Assigns short GUIDs (base64 of incrementing 16-bit int) to every job and
+  two-stage group for compact GHA naming.
+
+### Output files (in `workflow/`)
+
+| File                  | Contents                                                                |
+|-----------------------|-------------------------------------------------------------------------|
+| `workflow.json`       | Dispatch groups: `{group_name: {standalone: [...], two_stage: [...]}}`  |
+| `dispatch.json`       | Shaped for GHA matrix: `{linux_two_stage: {keys, jobs}, ...}`           |
+| `job_ids.json`        | `{id: "group_name job_name"}` — used by result aggregation              |
+| `job_list.txt`        | Human-readable job list with IDs                                        |
+| `runner_summary.json` | Runner counts table                                                     |
+| `override.json`       | Present only when override workflow is active                           |
+| `changes.md`          | inspect_changes summary (PR mode only)                                  |
+
+### Override path
+
+`--allow-override` is passed only in PR mode. If
+`matrix_yaml["workflows"]["override"]` is non-empty, `build-workflow.py`
+substitutes the override list for the requested workflow and writes
+`workflow/override.json`. The `workflow-results` action checks for this file
+and fails the workflow, blocking merging.
+
+### Job command construction
+
+`generate_dispatch_job_command` builds the shell command from the matrix job:
+
+```
+"./ci/build_cub.sh" -std "17" -arch "80-real"
+"./ci/test_thrust.sh" -std "20" -arch "gpu"
+```
+
+Script path: `./ci/<job_prefix>_<project_id>.sh` (Linux) or
+`./ci/windows/<job_prefix>_<project_id>.ps1`.
+
+### force_producer_ctk
+
+Some consumer job types declare `force_producer_ctk` in the jobs catalogue.
+When present, the producer is built at a different CTK than the consumer runs
+at. All consumers in a two-stage group must agree on the forced CTK.
+
+### Lite workflow variant
+
+When `--lite-build-projects` is populated and a `<workflow_name>_lite` workflow
+exists in `matrix.yaml`, lite projects are pulled from that variant instead of
+the full workflow. This reduces the matrix for projects whose transitive
+dependencies changed but whose own files did not.
diff --git a/.agent/skills/cccl_detail-ci/references/tools.md b/.agent/skills/cccl_detail-ci/references/tools.md
new file mode 100644
index 00000000000..50686c48e76
--- /dev/null
+++ b/.agent/skills/cccl_detail-ci/references/tools.md
@@ -0,0 +1,15 @@
+# Tool index — cccl_detail-ci
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose | Detail |
+|------|---------|--------|
+| `ci/inspect_changes.py` | Classifies dirty CCCL subprojects between two commits or from an explicit file list. Drives job pruning in PR CI. | `references/inspect_changes_usage.md` |
+| `ci/ninja_summary.py` | Parses `.ninja_log` to produce a weighted build-time summary (elapsed / weighted by concurrency). Called in `ci/build_common.sh` on CI builds. | no dedicated usage doc; run `ci/ninja_summary.py -h` |
+| `ci/test/inspect_changes/regenerate_outputs.sh` | Regenerates expected-output baselines for `inspect_changes.py`'s test suite under `ci/test/inspect_changes/`. | run from repo root; no flags |
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `ci/build_common.sh` | Sourced by all `ci/build_*.sh` scripts; defines `build_preset`, `test_preset`, `run_ci_timed_command`, etc. | `cccl-build` → `references/build_common.sh_usage.md` |
diff --git a/.agent/skills/cccl_detail-cmake/SKILL.md b/.agent/skills/cccl_detail-cmake/SKILL.md
new file mode 100644
index 00000000000..2d9ccfd672b
--- /dev/null
+++ b/.agent/skills/cccl_detail-cmake/SKILL.md
@@ -0,0 +1,110 @@
+---
+description: |
+  CCCL CMake internals — helper modules, custom commands, architecture-flag translation,
+  metatarget dot-path system, compiler interface flags, and downstream-consumer target
+  surface. Deep reference for questions about cccl_add_executable, all-major-cccl
+  expansion, CCCL::CCCL linking, CCCLConfig, or CPM dependency wiring.
+  Triggers: "cmake internals", "cccl_add_executable", "all-major-cccl expansion",
+  "CCCL::CCCL target", "downstream consumer", "cmake helpers", "metatarget dot-path".
+---
+
+Deep-internals reference for CCCL's CMake layer. Covers every module under `cmake/`,
+the custom commands authors call in tests/examples, architecture-flag expansion, the
+metatarget dot-path naming system, and the exported target surface for downstream consumers.
+
+## Helper module index
+
+Full table in `references/cmake-module-index.md`.
+
+Quick map:
+
+| Module                     | Role                                                                                              |
+|---------------------------|---------------------------------------------------------------------------------------------------|
+| `AppendOptionIfAvailable`  | Probe-and-append a compiler flag via `check_cxx_compiler_flag`                                   |
+| `CCCLAddExecutable`        | `cccl_add_executable` — build executable with standard CCCL setup                                |
+| `CCCLAddSubdir` / `CCCLAddSubdirHelper` | Include a sub-library via its package config without `find_package` re-entry                      |
+| `CCCLBuildCompilerTargets` | Define `cccl.compiler_interface` INTERFACE target; accumulate all warning/error flags              |
+| `CCCLCheckCudaArchitectures` | Expand `all-major-cccl` / `all-cccl` pseudo-values in `CMAKE_CUDA_ARCHITECTURES`                  |
+| `CCCLClangdCompileInfo`    | Enable `compile_commands.json` and symlink it to the source root                                 |
+| `CCCLConfigureTarget`      | Apply standard properties (CXX/CUDA standard, dialect features, output dirs) to any target       |
+| `CCCLDevBuildChecks`       | Enforce CXX == CUDA standard; default to C++17 when neither is set                               |
+| `CCCLEnsureMetaTargets`    | Create dot-path custom targets (`cub`, `cub.test`, …) as umbrella build targets                  |
+| `CCCLGenerateHeaderTests`  | Generate per-header `OBJECT` libraries + link-check executables for include hygiene              |
+| `CCCLGetDependencies`      | CPM macros: `cccl_get_catch2`, `cccl_get_boost`, `cccl_get_nvbench`, etc.                        |
+| `CCCLHideThirdPartyOptions` | `mark_as_advanced` for CPM / Catch2 / LLVM noise variables                                       |
+| `CCCLInstallRules`         | `cccl_generate_install_rules` — header globs + CMake package install per sub-library             |
+| `CCCLTestParams`           | `%PARAM%`-comment scanner — cartesian-product test variant expansion                             |
+| `CCCLUtilities`            | `cccl_add_compile_test`, `cccl_add_xfail_compile_target_test`, `cccl_execute_non_fatal_process` |
+| `CCCLAddTidyTarget`        | `cccl_tidy_init` / `cccl_tidy_add_target` — per-source `clang-tidy` custom targets                |
+
+## Custom commands
+
+Full signatures in `references/custom-commands.md`.
+
+Core commands authors encounter:
+
+- `cccl_add_executable(target SOURCES … [ADD_CTEST] [NO_METATARGETS] [NO_CLANG_TIDY] [DIALECT N] [METATARGET_PATH path])` — creates an executable, calls `cccl_configure_target`, optionally registers a CTest, and hooks into the metatarget hierarchy.
+- `cccl_configure_target(target [DIALECT N])` — enforces standard, extension-off, and output directory properties. Called by all target-creating commands.
+- `cccl_add_compile_test(result_var name_prefix subdir test_id [CTEST_COMMAND …] cmake_opts…)` — registers a CTest that does a full configure-build-test of an out-of-tree CMake project (used in `examples/`).
+- `cccl_add_xfail_compile_target_test(target [ERROR_REGEX …] [SOURCE_FILE …] [ERROR_REGEX_LABEL …])` — wraps an expected-failure compile as a CTest; regex extracted from source comment annotations.
+- `cccl_generate_header_tests(target include_path [GLOBS …] [EXCLUDES …] [PER_HEADER_DEFINES …])` — generates one `.cu`/`.cpp` file per header and a link-check executable that catches non-inline function definitions.
+- `cccl_parse_variant_params(src …)` / `cccl_get_variant_data(…)` — `%PARAM%`-comment parser for test variant matrices.
+- `cccl_generate_install_rules(PROJECT_NAME enable [NO_HEADERS] [HEADERS_SUBDIRS …] [PACKAGE])` — produces `install()` rules for headers and the CMake package.
+
+## Architecture-flag translation
+
+See `references/arch-flags.md` for the full expansion table and internal call chain.
+
+`cmake/CCCLCheckCudaArchitectures.cmake` intercepts `CMAKE_CUDA_ARCHITECTURES` at configure time. Recognized pseudo-values:
+
+| Value             | Expansion rule                                                        |
+|-------------------|-----------------------------------------------------------------------|
+| `all-cccl`        | All arches nvcc reports via `--help` ≥ `minimum_cccl_arch` (currently 75) |
+| `all-major-cccl`  | Major-only subset of `all-cccl` (one per SM generation)               |
+
+Both values tag every arch `-real` and the last arch additionally `-virtual`. Standard CMake values (`native`, `all`, `all-major`, numeric, `XX-real`, `XX-virtual`) pass through unchanged.
+
+`minimum_cccl_arch` tracks the minimum architecture supported by the latest CTK; currently 75 (Turing) after CTK 13.x dropped pre-Turing.
+
+## Metatarget dot-path system
+
+`cccl_ensure_metatargets` splits a target name (or `METATARGET_PATH`) on `.` and creates a chain of `add_custom_target` umbrellas: `cudax` → `cudax.test` → `cudax.test.mytest`. This lets `ninja cudax` rebuild every cudax target and `ninja cudax.test` rebuild only tests.
+
+## Downstream-consumer surface
+
+See `references/downstream-consumers.md` for a worked `CMakeLists.txt` example.
+
+`lib/cmake/cccl/cccl-config.cmake` is the entry point. `find_package(CCCL CONFIG)` transitively finds and links libcudacxx, CUB, and Thrust. `cudax` is included only when `CCCL_ENABLE_UNSTABLE` is set.
+
+Exported targets:
+
+| Target           | Provides                                                                   |
+|------------------|---------------------------------------------------------------------------|
+| `CCCL::CCCL`     | All components as a single link target                                    |
+| `CCCL::libcudacxx` | Alias to `_libcudacxx_libcudacxx`                                          |
+| `CCCL::CUB`      | IMPORTED INTERFACE wrapping `CUB::CUB` (not alias — supports downstream export sets) |
+| `CCCL::Thrust`   | Created via `thrust_create_target` with host=CPP device=CUDA defaults     |
+| `CCCL::cudax`    | Alias to `cudax::cudax` (or `_cudax_cudax` in internal builds)             |
+
+Each sub-library also exports its own target (`libcudacxx::libcudacxx`, `CUB::CUB`, `Thrust::Thrust`, `cudax::cudax`) via per-library config files under `lib/cmake/<name>/`.
+
+## Compiler interface and build options
+
+`CCCLBuildCompilerTargets.cmake` defines `cccl.compiler_interface`, an INTERFACE target that accumulates warning flags (`-Wall`, `-Wextra`, Clang-only set, MSVC `/W4`) and definitions (`CCCL_DISABLE_EXCEPTIONS`, `CCCL_DISABLE_RTTI`, `_CCCL_NO_SYSTEM_HEADER`). Flags are applied conditionally via `$<COMPILE_LANG_AND_ID:…>` generators.
+
+MSVC workaround: `CMAKE_MSVC_DEBUG_INFORMATION_FORMAT` is forced to `Embedded` so sccache can handle PDB generation without the `-Fd` flag conflict.
+
+Developer build guard: `CMAKE_CUDA_HOST_COMPILER` must match `CMAKE_CXX_COMPILER` (enforced since CMake 3.31 via `CMAKE_CUDA_HOST_COMPILER_ID`/`VERSION`).
+
+## CPM and third-party dependency wiring
+
+`CCCLGetDependencies.cmake` provides `cccl_get_*` macros that gate CPM inclusion behind a guard so the `CPM.cmake` file is included only once. Key packages: Catch2 3.12.0, Boost 1.83.0, NVBench (SHA-pinned via `CCCL_NVBENCH_SHA` cache var), NVTX v3, dlpack v1.2.
+
+Sub-library packages (CUB, Thrust, libcudacxx, cudax) are fetched via `find_package(… NO_DEFAULT_PATH HINTS …)` pointing at `CCCL_SOURCE_DIR/lib/cmake/<name>/` — no CPM involved for first-party deps.
+
+## Additional resources
+
+- `references/cmake-module-index.md` — full module inventory with file paths
+- `references/custom-commands.md` — complete command signatures and parameter tables
+- `references/arch-flags.md` — full arch-expansion table, internal call chain, minimum_cccl_arch history
+- `references/downstream-consumers.md` — exported targets table and worked consumer example
diff --git a/.agent/skills/cccl_detail-cmake/references/arch-flags.md b/.agent/skills/cccl_detail-cmake/references/arch-flags.md
new file mode 100644
index 00000000000..c16b99640e2
--- /dev/null
+++ b/.agent/skills/cccl_detail-cmake/references/arch-flags.md
@@ -0,0 +1,42 @@
+# Architecture flag translation
+
+Source: `cmake/CCCLCheckCudaArchitectures.cmake`
+
+## Recognized pseudo-values
+
+| Input value      | Expansion                                                                                   |
+|------------------|-----------------------------------------------------------------------------------------------|
+| `all-cccl`       | All arches reported by `nvcc --help` matching `compute_NNN`, filtered to ≥ `minimum_cccl_arch` |
+| `all-major-cccl` | Major-only subset of `all-cccl`: one entry per SM generation (`(arch / 10) * 10`), with the minimum clamped to `minimum_cccl_arch` |
+
+All other values (`native`, `all`, `all-major`, numeric lists, `XX-real`, `XX-virtual`) pass through CMake's standard handling unchanged.
+
+## Tag application
+
+After filtering, both pseudo-values apply `-real` to every arch and `-real` + `-virtual` to the last arch in the list. This ensures PTX is emitted for the latest SM (forward-compat JIT) while all others compile to SASS only.
+
+Example on CTK 12.9, `all-major-cccl` → `75-real;80-real;90-real;100-real;120-real;120-virtual`
+
+## Internal call chain
+
+```
+cccl_check_cuda_architectures()         ← called from top-level CMakeLists.txt
+  _cccl_detect_nvcc_arch_support()      ← runs nvcc --help, extracts compute_NNN
+  _cccl_filter_to_supported_arches()    ← removes arches < minimum_cccl_arch
+  _cccl_filter_to_all_major_cccl()      ← (all-major-cccl only) keeps one per SM gen
+  _cccl_add_real_virtual_arch_tags()    ← tags -real / -virtual, pops last element for -virtual
+```
+
+Result is written back to `CMAKE_CUDA_ARCHITECTURES` cache with `FORCE`.
+
+## minimum_cccl_arch history
+
+| Value | Reason                          |
+|-------|----------------------------------|
+| 75    | CTK 13.x dropped pre-Turing support |
+
+The variable lives in `CCCLCheckCudaArchitectures.cmake` and must be bumped whenever the minimum supported CTK drops an architecture tier.
+
+## Pitfall: must call before `enable_language(CUDA)`
+
+`cccl_check_cuda_architectures()` must be called before `enable_language(CUDA)` if you want the expanded value to influence the CUDA compiler setup. Once CMake has processed the language, `CMAKE_CUDA_ARCHITECTURES` changes may not propagate to all compiler invocations.
diff --git a/.agent/skills/cccl_detail-cmake/references/custom-commands.md b/.agent/skills/cccl_detail-cmake/references/custom-commands.md
new file mode 100644
index 00000000000..d3a361e1034
--- /dev/null
+++ b/.agent/skills/cccl_detail-cmake/references/custom-commands.md
@@ -0,0 +1,109 @@
+# Custom command signatures
+
+## cccl_add_executable
+
+```cmake
+cccl_add_executable(
+  <target_name>
+  SOURCES <file> [<file>…]
+  [ADD_CTEST]             # register a no-arg CTest with the same name
+  [NO_METATARGETS]        # skip dot-path metatarget registration
+  [NO_CLANG_TIDY]         # skip clang-tidy target hookup
+  [DIALECT <std>]         # override CXX/CUDA standard (e.g. 17, 20)
+  [METATARGET_PATH <path>]# override the dot-path used for metatarget registration
+)
+```
+
+Calls `cccl_configure_target`, then optionally `add_test`, `cccl_ensure_metatargets`, `cccl_tidy_add_target`.
+
+## cccl_configure_target
+
+```cmake
+cccl_configure_target(<target> [DIALECT <std>])
+```
+
+Sets: `CXX_EXTENSIONS OFF`, `CUDA_EXTENSIONS OFF`, standard properties, `cxx_std_N`/`cuda_std_N` compile features, and output dirs (`ARCHIVE_OUTPUT_DIRECTORY`, etc.) to `${CCCL_BINARY_DIR}/lib` / `bin`. INTERFACE libraries receive `INTERFACE` compile features instead of `PUBLIC`.
+
+## cccl_add_compile_test
+
+```cmake
+cccl_add_compile_test(
+  <result_var>        # receives the generated test name
+  <name_prefix>       # e.g. cccl.example
+  <subdir>            # relative path to the sub-project directory
+  <test_id>           # unique suffix (allows same subdir, multiple ids)
+  [CTEST_COMMAND <cmd>]
+  <cmake_configure_args>…
+)
+```
+
+Registers a CTest that runs `ctest --build-and-test <src> <build> --build-generator … --build-options <args> --test-command ctest --output-on-failure`. Build directory is `${CMAKE_CURRENT_BINARY_DIR}/<subdir>/<test_id>`.
+
+## cccl_add_xfail_compile_target_test
+
+```cmake
+cccl_add_xfail_compile_target_test(
+  <target_name>
+  [TEST_NAME <name>]
+  [ERROR_REGEX <regex>]
+  [SOURCE_FILE <path>]
+  [ERROR_REGEX_LABEL <label>]          # scan SOURCE_FILE for // LABEL {{"regex"}}
+  [ERROR_NUMBER <n>]                   # match // LABEL-N {{"regex"}}
+  [ERROR_NUMBER_TARGET_NAME_REGEX <r>] # extract N from target name
+)
+```
+
+Marks the target `EXCLUDE_FROM_ALL`, adds a cleanup fixture that deletes the output file, registers a CTest that builds the target. Test passes if: (a) `PASS_REGULAR_EXPRESSION` matches the output, or (b) no regex provided and build fails (`WILL_FAIL true`). Source file changes re-trigger CMake via `CMAKE_CONFIGURE_DEPENDS`.
+
+## cccl_generate_header_tests
+
+```cmake
+cccl_generate_header_tests(
+  <target_name>
+  <project_include_path>  # relative to CCCL_SOURCE_DIR, e.g. libcudacxx/include
+  [DIALECT <std>]         # passed to cccl_configure_target
+  [LANGUAGE CXX|CUDA]     # default: CUDA
+  [HEADER_TEMPLATE <file>]
+  [GLOBS <pattern>…]
+  [EXCLUDES <pattern>…]
+  [HEADERS <file>…]
+  [PER_HEADER_DEFINES DEFINE <def> <regex>… [DEFINE …]]
+  [NO_METATARGETS]
+)
+```
+
+Configures `cmake/header_test.cu.in` (or custom template) for each matched header, replacing `@header@`. Builds an OBJECT library. Also creates `<target_name>.link_check` that links the objects twice — this causes a duplicate-symbol error if any header function lacks `inline`.
+
+## cccl_generate_install_rules
+
+```cmake
+cccl_generate_install_rules(
+  <PROJECT_NAME>        # case-sensitive, used for option name
+  <DEFAULT_ENABLE>      # ON/OFF default for the install option
+  [NO_HEADERS]
+  [HEADERS_SUBDIRS <dir>…]
+  [HEADERS_INCLUDE <glob>…]
+  [HEADERS_EXCLUDE <glob>…]
+  [PACKAGE]             # install the CMake package from lib/cmake/<name_lower>/
+)
+```
+
+Creates cache option `<PROJECT_NAME>_ENABLE_INSTALL_RULES`. Headers installed to `CMAKE_INSTALL_INCLUDEDIR`. Package to `CMAKE_INSTALL_LIBDIR/cmake/`. If a `<name>-header-search.cmake.in` exists in the package dir, it is configured with `_CCCL_RELATIVE_LIBDIR` and installed.
+
+## cccl_parse_variant_params / cccl_get_variant_data
+
+Source files embed `%PARAM%` comment lines:
+
+```cpp
+// %PARAM% DEFINITION_NAME label value1:value2:value3
+```
+
+`cccl_parse_variant_params(src num_var labels_var defs_var)` extracts these into cartesian-product variant lists. `cccl_get_variant_data(labels defs idx label_out defs_out)` retrieves one variant by index (also appends `VAR_IDX=N`).
+
+## cccl_ensure_metatargets
+
+```cmake
+cccl_ensure_metatargets(<target_name> [METATARGET_PATH <path>])
+```
+
+Splits `METATARGET_PATH` on `.`, creating custom targets for each prefix segment if they don't exist, chaining `add_dependencies` up the tree. The real target is made a dependency of the leaf segment.
diff --git a/.agent/skills/cccl_detail-cmake/references/downstream-consumers.md b/.agent/skills/cccl_detail-cmake/references/downstream-consumers.md
new file mode 100644
index 00000000000..e73c8bab198
--- /dev/null
+++ b/.agent/skills/cccl_detail-cmake/references/downstream-consumers.md
@@ -0,0 +1,64 @@
+# Downstream consumer surface
+
+## Entry point
+
+```cmake
+find_package(CCCL CONFIG REQUIRED
+  HINTS /path/to/cccl/lib/cmake/cccl/
+)
+target_link_libraries(my_target PRIVATE CCCL::CCCL)
+```
+
+`cccl-config.cmake` at `lib/cmake/cccl/` transitively locates and configures all enabled components via sibling `lib/cmake/<name>/` directories. No separate `find_package` calls for sub-libraries are needed when using the umbrella target.
+
+## Exported targets
+
+| Target              | Type                       | Provides                                                                                      |
+|---------------------|----------------------------|-----------------------------------------------------------------------------------------------|
+| `CCCL::CCCL`        | INTERFACE IMPORTED GLOBAL  | All components via single link                                                              |
+| `CCCL::libcudacxx`  | ALIAS → `_libcudacxx_libcudacxx` | libcudacxx headers                                                                    |
+| `CCCL::CUB`         | INTERFACE IMPORTED GLOBAL  | CUB (wraps `CUB::CUB`; IMPORTED not ALIAS to allow downstream export sets)                 |
+| `CCCL::Thrust`      | Created via `thrust_create_target` | Thrust with configurable host/device systems                                    |
+| `CCCL::cudax`       | ALIAS → `cudax::cudax`     | Experimental features (only with `CCCL_ENABLE_UNSTABLE`)                                    |
+
+Sub-library targets also exported directly: `libcudacxx::libcudacxx`, `CUB::CUB`, `Thrust::Thrust`, `cudax::cudax`.
+
+## Thrust host/device customization
+
+When `CCCL::Thrust` is created, `thrust_create_target` reads two options:
+
+| Option                     | Default | Controls       |
+|----------------------------|---------|----------------|
+| `CCCL_THRUST_HOST_SYSTEM`  | `CPP`   | Host backend   |
+| `CCCL_THRUST_DEVICE_SYSTEM` | `CUDA`  | Device backend |
+
+Set `CCCL_ENABLE_DEFAULT_THRUST_TARGET=OFF` to suppress `CCCL::Thrust` creation and call `thrust_create_target` manually.
+
+## Component selection
+
+```cmake
+find_package(CCCL CONFIG REQUIRED COMPONENTS libcudacxx CUB)
+```
+
+Only the requested components are configured. Omitting `COMPONENTS` enables all three (Thrust, CUB, libcudacxx). `cudax` is never included automatically — it requires both an explicit `COMPONENTS cudax` and `CCCL_ENABLE_UNSTABLE=ON`.
+
+## Internal vs. external find_package
+
+Within the CCCL source tree, sub-libraries are never brought in via `find_package` at the CMake top level. `cccl_add_subdir_helper` directly `include()`s the package config files from `CCCL_SOURCE_DIR/lib/cmake/<name>/`, bypassing the find_package machinery to avoid re-configuration inconsistencies under CPM. See `cmake/CCCLAddSubdirHelper.cmake` comment referencing NVIDIA/libcudacxx#242 for details.
+
+## Install layout
+
+After `cmake --install`:
+
+```
+<prefix>/
+  include/                 ← all sub-library headers
+  lib/cmake/
+    cccl/                  ← cccl-config.cmake, cccl-config-version.cmake
+    libcudacxx/            ← libcudacxx-config.cmake, libcudacxx-header-search.cmake
+    cub/
+    thrust/
+    cudax/
+```
+
+Each `*-header-search.cmake` is a configured file that records the relative path from the cmake package dir to the include dir, allowing the package to locate its headers without hardcoded paths.
diff --git a/.agent/skills/cccl_detail-cpp-macros/SKILL.md b/.agent/skills/cccl_detail-cpp-macros/SKILL.md
new file mode 100644
index 00000000000..145b5d1fafd
--- /dev/null
+++ b/.agent/skills/cccl_detail-cpp-macros/SKILL.md
@@ -0,0 +1,137 @@
+---
+description: |
+  CCCL `_CCCL_*` internal macro catalog — compiler detection, CUDA compiler
+  and version queries, C++ dialect, execution-space qualifiers, visibility/ABI,
+  diagnostics push/pop, and portability shims. Used by anyone authoring CCCL
+  headers across libcudacxx, CUB, Thrust, and cudax.
+  Triggers: "_CCCL_ macro", "_CCCL_HOST_DEVICE", "_CCCL_API", "compiler detection macro",
+  "visibility macro", "diagnostic suppression", "_CCCL_STD_VER".
+---
+
+All `_CCCL_*` macros are defined under
+`libcudacxx/include/cuda/std/__cccl/`. Every header there is included
+transitively by `<cuda/__cccl_config>`. The macros are shared across the
+entire CCCL family — they are not libcudacxx-private.
+
+## Header map
+
+| Header               | Contents                                                                                     |
+|----------------------|-------------------------------------------------------------------------------------------------|
+| `compiler.h`         | Host compiler detection, CUDA compiler detection, compilation-phase predicates, `_CCCL_PRAGMA` |
+| `dialect.h`          | `_CCCL_STD_VER`, dialect-gated `constexpr`/`consteval`, feature predicates                    |
+| `execution_space.h`  | `_CCCL_HOST`, `_CCCL_DEVICE`, `_CCCL_HOST_DEVICE`, `_CCCL_GRID_CONSTANT`                     |
+| `visibility.h`       | `_CCCL_API`, `_CCCL_HIDE_FROM_ABI`, `_CCCL_KERNEL_ATTRIBUTES` and variants                   |
+| `attributes.h`       | `_CCCL_NODEBUG`, `_CCCL_ARTIFICIAL`, `_CCCL_PURE`, `_CCCL_ASSUME`, `_CCCL_NO_UNIQUE_ADDRESS`, etc. |
+| `diagnostic.h`       | `_CCCL_DIAG_PUSH/POP`, per-compiler `_CCCL_DIAG_SUPPRESS_*`, `_CCCL_SUPPRESS_DEPRECATED_PUSH/POP` |
+| `cuda_capabilities.h` | `_CCCL_PTX_ARCH`, RDC/EWP/CDP/PDL feature predicates, `_CCCL_LAUNCH_BOUNDS`                  |
+| `deprecated.h`       | `CCCL_DEPRECATED`, `CCCL_DEPRECATED_BECAUSE`, dialect-gated deprecation shims               |
+
+## Compiler detection
+
+`_CCCL_COMPILER(ID)` and `_CCCL_COMPILER(ID, OP, MAJOR[, MINOR])` query the
+host compiler. `_CCCL_CUDA_COMPILER(ID[, ...])` queries the CUDA compiler.
+Both use a versioned dispatch pattern — each ID function-macro returns a
+`(major, minor)` pair or `_CCCL_VERSION_INVALID()`.
+
+Supported host IDs: `GCC`, `CLANG`, `MSVC`, `MSVC2019`, `MSVC2022`,
+`MSVC2026`, `NVHPC`, `NVRTC`.
+
+Supported CUDA compiler IDs: `NVCC`, `NVHPC`, `CLANG`, `NVRTC`.
+
+Compilation-phase predicates:
+
+| Macro                       | True when                                       |
+|------------------------------|--------------------------------------------------|
+| `_CCCL_CUDA_COMPILATION()`  | Compiling a `.cu` file                          |
+| `_CCCL_HOST_COMPILATION()`  | `__CUDA_ARCH__` not defined                     |
+| `_CCCL_DEVICE_COMPILATION()` | CUDA device pass active                         |
+| `_CCCL_CUDACC()`            | Returns `(major, minor)` of the active CUDA toolkit |
+| `_CCCL_CUDACC_AT_LEAST(M[, N])` | CUDA toolkit ≥ M.N                          |
+| `_CCCL_CUDACC_BELOW(M[, N])` | CUDA toolkit < M.N                             |
+
+See `references/compiler-detection.md` for usage patterns and version
+comparison idioms.
+
+## C++ dialect
+
+`_CCCL_STD_VER` — integer year (2011, 2014, 2017, 2020, 2023, 2024).
+Compare with `>=` / `<=`.
+
+Dialect-gated `constexpr`: `_CCCL_CONSTEXPR_CXX20`, `_CCCL_CONSTEXPR_CXX23`.
+Use these instead of raw `constexpr` when backporting to C++17.
+
+Feature predicates (return 0/1): `_CCCL_HAS_CONCEPTS()`,
+`_CCCL_HAS_PACK_INDEXING()`, `_CCCL_HAS_CHAR8_T()`,
+`_CCCL_HAS_LONG_DOUBLE()`, `_CCCL_HAS_MULTIARG_OPERATOR_BRACKETS()`.
+
+## Execution-space qualifiers
+
+| Macro                   | Expands to                                                 |
+|-------------------------|-------------------------------------------------------------|
+| `_CCCL_HOST_DEVICE`     | `__host__ __device__` in CUDA builds; empty otherwise      |
+| `_CCCL_HOST`            | `__host__` in CUDA builds                                  |
+| `_CCCL_DEVICE`          | `__device__` in CUDA builds                                |
+| `_CCCL_GRID_CONSTANT`   | `__grid_constant__` on supported toolchains (sm_70+, CUDA ≥ 12.8) |
+| `_CCCL_EXEC_CHECK_DISABLE` | Disables NVCC exec-space-check for a function             |
+| `_CCCL_LAUNCH_BOUNDS(...)` | `__launch_bounds__(...)` unless RDC is active             |
+
+Always use `_CCCL_HOST_DEVICE` for any function callable from both host and
+device contexts. Never use raw `__host__ __device__` in CCCL headers.
+
+## Visibility and ABI
+
+Use the function-qualifier macros, not the raw visibility macros, for all
+function declarations.
+
+| Macro                   | Use for                                                  |
+|-------------------------|----------------------------------------------------------|
+| `_CCCL_API`             | Standard internal `__host__ __device__` function — hidden from ABI |
+| `_CCCL_HOST_API`        | Host-only variant                                        |
+| `_CCCL_DEVICE_API`      | Device-only variant                                      |
+| `_CCCL_NODEBUG_API`     | Same as `_CCCL_API` + debugger-skip (`inline` implied)   |
+| `_CCCL_TRIVIAL_API`     | Same as `_CCCL_NODEBUG_API` + force-inline; for dispatch/CPO glue |
+| `_CCCL_PUBLIC_API`      | Visible across DSO boundary — use when address appears in a public type |
+| `_CCCL_KERNEL_ATTRIBUTES` | `__global__ _CCCL_VISIBILITY_HIDDEN` for kernel definitions |
+| `_CCCL_HIDE_FROM_ABI`   | Legacy; prefer `_CCCL_API` for new code                  |
+| `_CCCL_FORCEINLINE`     | Force-inline without ABI hiding                          |
+
+`_LIBCUDACXX_HIDE_FROM_ABI` is a compatibility alias for external projects;
+do not use it in new CCCL code.
+
+See `references/visibility-abi.md` for the full attribute composition and
+NVHPC workaround notes.
+
+## Diagnostics
+
+Always bracket suppression with push/pop — never suppress without restoring.
+
+Host-compiler diagnostics:
+
+```cpp
+_CCCL_DIAG_PUSH
+_CCCL_DIAG_SUPPRESS_CLANG("-Wshadow")   // no-op on other compilers
+_CCCL_DIAG_SUPPRESS_GCC("-Wshadow")
+_CCCL_DIAG_SUPPRESS_MSVC(4267)
+// ... code ...
+_CCCL_DIAG_POP
+```
+
+NVCC/NVRTC diagnostics (numeric codes):
+
+```cpp
+_CCCL_BEGIN_NV_DIAG_SUPPRESS(20012, 20013)
+// ... code ...
+_CCCL_END_NV_DIAG_SUPPRESS()
+```
+
+Compound shortcut: `_CCCL_SUPPRESS_DEPRECATED_PUSH` / `_CCCL_SUPPRESS_DEPRECATED_POP`
+suppresses deprecation warnings across all supported compilers simultaneously.
+
+See `references/diagnostics.md` for the full suppress-macro table and
+common warning codes.
+
+## Additional resources
+
+- `references/compiler-detection.md` — version comparison idioms, full ID list, freestanding/NVRTC notes
+- `references/visibility-abi.md` — attribute composition matrix, `_CCCL_API` vs `_CCCL_TRIVIAL_API` decision, NVHPC quirks
+- `references/diagnostics.md` — per-compiler suppress macros, common warning codes, push/pop patterns
diff --git a/.agent/skills/cccl_detail-cpp-macros/references/compiler-detection.md b/.agent/skills/cccl_detail-cpp-macros/references/compiler-detection.md
new file mode 100644
index 00000000000..3563d04a375
--- /dev/null
+++ b/.agent/skills/cccl_detail-cpp-macros/references/compiler-detection.md
@@ -0,0 +1,97 @@
+# Compiler detection reference
+
+## `_CCCL_COMPILER` and `_CCCL_CUDA_COMPILER` usage
+
+Both macros use the same versioned dispatch pattern:
+
+```cpp
+_CCCL_COMPILER(ID)                     // true if host compiler is ID
+_CCCL_COMPILER(ID, OP, MAJOR)          // compare major version
+_CCCL_COMPILER(ID, OP, MAJOR, MINOR)   // compare major.minor version
+
+_CCCL_CUDA_COMPILER(ID)                // true if CUDA compiler is ID
+_CCCL_CUDA_COMPILER(ID, OP, MAJOR)
+_CCCL_CUDA_COMPILER(ID, OP, MAJOR, MINOR)
+```
+
+`OP` is a C comparison operator: `<`, `<=`, `==`, `>=`, `>`.
+
+Examples:
+
+```cpp
+#if _CCCL_COMPILER(GCC, >=, 12)
+// GCC 12+
+#endif
+
+#if _CCCL_CUDA_COMPILER(NVCC, <, 12, 8)
+// NVCC older than 12.8
+#endif
+
+#if _CCCL_COMPILER(CLANG) && _CCCL_CUDA_COMPILER(CLANG)
+// clang-cuda
+#endif
+```
+
+## Host compiler IDs
+
+| ID          | Detected by             |
+|-------------|-------------------------|
+| `GCC`       | `__GNUC__` (not clang, not icc) |
+| `CLANG`     | `__clang__`             |
+| `MSVC`      | `_MSC_VER`              |
+| `MSVC2019`  | MSVC 19.20–19.29        |
+| `MSVC2022`  | MSVC 19.30–19.49        |
+| `MSVC2026`  | MSVC 19.50+             |
+| `NVHPC`     | `__NVCOMPILER`          |
+| `NVRTC`     | `__CUDACC_RTC__`        |
+
+Intel Classic (`icc`/`icpc`) is not supported — CCCL emits a warning.
+NVRTC is treated as freestanding (no host stdlib).
+
+## CUDA compiler IDs
+
+| ID    | Detected by                       |
+|-------|-----------------------------------|
+| `NVCC` | `__NVCC__` (inside `.cu` compile) |
+| `NVHPC` | `_NVHPC_CUDA`                     |
+| `CLANG` | clang `__CUDA__` + `_CCCL_COMPILER(CLANG)` |
+| `NVRTC` | `_CCCL_COMPILER(NVRTC)` (same compiler) |
+
+## CUDA toolkit version (`_CCCL_CUDACC`)
+
+`_CCCL_CUDACC()` returns `(major, minor)` of the active CUDA toolkit, or
+`_CCCL_VERSION_INVALID()` when not in a CUDA compilation.
+
+Use the shorthand predicates:
+
+```cpp
+_CCCL_CUDACC_AT_LEAST(12)        // toolkit >= 12.0
+_CCCL_CUDACC_AT_LEAST(12, 5)     // toolkit >= 12.5
+_CCCL_CUDACC_BELOW(13)           // toolkit < 13.0
+_CCCL_CUDACC_EQUAL(12, 8)        // toolkit == 12.8
+```
+
+## Freestanding / NVRTC notes
+
+`_CCCL_FREESTANDING()` is 1 when `_CCCL_ENABLE_FREESTANDING` is defined or
+the compiler is NVRTC. `_CCCL_HOSTED()` is its complement.
+`_CCCL_HOSTJIT()` is 1 in freestanding non-NVRTC contexts (GPU JIT with
+host-stdlib access).
+
+## Compilation-phase predicates
+
+`_CCCL_CUDA_COMPILATION()`, `_CCCL_HOST_COMPILATION()`, and
+`_CCCL_DEVICE_COMPILATION()` test the current compilation pass, not the
+compiler. In a `.cu` file, both `_CCCL_CUDA_COMPILATION()` and (on the
+host pass) `_CCCL_HOST_COMPILATION()` can be 1 simultaneously.
+
+## `_CCCL_PRAGMA`
+
+Portable pragma emission:
+
+```cpp
+_CCCL_PRAGMA(unroll 4)               // emits #pragma unroll 4 (or __pragma on MSVC)
+_CCCL_PRAGMA_UNROLL(N)               // portable loop-unroll hint
+_CCCL_PRAGMA_UNROLL_FULL()           // unroll all iterations
+_CCCL_PRAGMA_NOUNROLL()              // prevent unrolling
+```
diff --git a/.agent/skills/cccl_detail-cpp-macros/references/diagnostics.md b/.agent/skills/cccl_detail-cpp-macros/references/diagnostics.md
new file mode 100644
index 00000000000..7e00dd5f2a0
--- /dev/null
+++ b/.agent/skills/cccl_detail-cpp-macros/references/diagnostics.md
@@ -0,0 +1,94 @@
+# Diagnostics reference
+
+## Push/pop pattern
+
+Always pair push with pop. Nesting is allowed.
+
+```cpp
+_CCCL_DIAG_PUSH
+_CCCL_DIAG_SUPPRESS_GCC("-Wunused-parameter")
+_CCCL_DIAG_SUPPRESS_CLANG("-Wunused-parameter")
+_CCCL_DIAG_SUPPRESS_MSVC(4100)
+// ... suppressed region ...
+_CCCL_DIAG_POP
+```
+
+Per-compiler suppression macros are no-ops on other compilers — include all
+relevant ones in the same block.
+
+## Host-compiler suppress macros
+
+| Macro                          | Active compiler | Argument                                   |
+|--------------------------------|-----------------|-------------------------------------------|
+| `_CCCL_DIAG_SUPPRESS_CLANG(W)` | Clang           | quoted warning flag, e.g. `"-Wshadow"`     |
+| `_CCCL_DIAG_SUPPRESS_GCC(W)`   | GCC             | quoted warning flag, e.g. `"-Wdeprecated"` |
+| `_CCCL_DIAG_SUPPRESS_NVHPC(W)` | NVHPC           | diagnostic name, e.g. `deprecated_entity`  |
+| `_CCCL_DIAG_SUPPRESS_MSVC(C)`  | MSVC            | numeric code, e.g. `4996`                  |
+
+## NVCC/NVRTC suppress macros
+
+Use numeric diagnostic codes. Multiple codes accepted.
+
+```cpp
+_CCCL_BEGIN_NV_DIAG_SUPPRESS(20012)
+_CCCL_BEGIN_NV_DIAG_SUPPRESS(20012, 20013, 1444)
+// ... suppressed region ...
+_CCCL_END_NV_DIAG_SUPPRESS()
+```
+
+`_CCCL_NV_DIAG_PUSH()` / `_CCCL_NV_DIAG_POP()` are the lower-level forms
+used when you need NVCC push/pop without immediately suppressing.
+`_CCCL_DIAG_SUPPRESS_NVCC(N)` suppresses a single code without push/pop.
+
+## Compound shortcuts
+
+`_CCCL_SUPPRESS_DEPRECATED_PUSH` / `_CCCL_SUPPRESS_DEPRECATED_POP` — covers
+all compilers in one call. Suppresses:
+- Clang: `-Wdeprecated`, `-Wdeprecated-declarations`
+- GCC: `-Wdeprecated`, `-Wdeprecated-declarations`
+- NVHPC: `deprecated_entity`, `deprecated_entity_with_custom_message`
+- MSVC: C4996
+- NVCC: 1444, 20199
+
+Use this to call deprecated CCCL APIs internally without leaking warnings:
+
+```cpp
+_CCCL_SUPPRESS_DEPRECATED_PUSH
+old_function();
+_CCCL_SUPPRESS_DEPRECATED_POP
+```
+
+## Common NVCC numeric codes
+
+| Code  | Meaning                                         |
+|-------|------------------------------------------------|
+| 1444  | deprecated entity                               |
+| 1675  | unrecognized `#pragma`                          |
+| 20012 | `__host__` annotation on `__device__`-only function |
+| 20013 | `__device__` annotation on `__host__`-only function |
+| 20199 | deprecated API                                  |
+
+## `_CCCL_WARNING`
+
+Emits a portable compiler warning at the call site:
+
+```cpp
+_CCCL_WARNING("this overload is slow on MSVC")
+```
+
+Expands to `#pragma message` on MSVC, `#pragma GCC warning` elsewhere.
+
+## Deprecation macros
+
+Defined in `deprecated.h`, not `diagnostic.h`:
+
+| Macro                         | Use                                  |
+|-------------------------------|--------------------------------------|
+| `CCCL_DEPRECATED`             | Mark a function or type deprecated   |
+| `CCCL_DEPRECATED_BECAUSE(MSG)` | Deprecated with a custom message     |
+| `_CCCL_DEPRECATED_IN_CXX20`   | Active only when `_CCCL_STD_VER >= 2020` |
+| `_CCCL_DEPRECATED_IN_CXX23`   | Active only when `_CCCL_STD_VER >= 2023` |
+
+Opt-out defines (user-facing): `CCCL_IGNORE_DEPRECATED_API`,
+`CCCL_IGNORE_DEPRECATED_COMPILER`, `CCCL_IGNORE_DEPRECATED_CPP_DIALECT`.
+These are public contract — do not remove.
diff --git a/.agent/skills/cccl_detail-cpp-macros/references/visibility-abi.md b/.agent/skills/cccl_detail-cpp-macros/references/visibility-abi.md
new file mode 100644
index 00000000000..af84bd53c62
--- /dev/null
+++ b/.agent/skills/cccl_detail-cpp-macros/references/visibility-abi.md
@@ -0,0 +1,74 @@
+# Visibility and ABI reference
+
+## Function-qualifier macro matrix
+
+| Macro                      | `__host__ __device__` | Visibility | `inline` | Force-inline | Debugger-skip |
+|----------------------------|----------------------|-----------|----------|--------------|---------------|
+| `_CCCL_API`                | `__host__ __device__` | hidden     | no       | no           | no            |
+| `_CCCL_HOST_API`           | `__host__`            | hidden     | no       | no           | no            |
+| `_CCCL_DEVICE_API`         | `__device__`          | hidden     | no       | no           | no            |
+| `_CCCL_NODEBUG_API`        | `__host__ __device__` | hidden     | yes      | no           | yes           |
+| `_CCCL_NODEBUG_HOST_API`   | `__host__`            | hidden     | yes      | no           | yes           |
+| `_CCCL_NODEBUG_DEVICE_API` | `__device__`          | hidden     | yes      | no           | yes           |
+| `_CCCL_TRIVIAL_API`        | `__host__ __device__` | hidden     | yes      | yes          | yes           |
+| `_CCCL_TRIVIAL_HOST_API`   | `__host__`            | hidden     | yes      | yes          | yes           |
+| `_CCCL_TRIVIAL_DEVICE_API` | `__device__`          | hidden     | yes      | yes          | yes           |
+| `_CCCL_PUBLIC_API`         | `__host__ __device__` | default    | no       | no           | no            |
+| `_CCCL_PUBLIC_HOST_API`    | `__host__`            | default    | no       | no           | no            |
+| `_CCCL_PUBLIC_DEVICE_API`  | `__device__`          | default    | no       | no           | no            |
+
+## Decision guide
+
+- **Default choice for internal helpers:** `_CCCL_API`.
+- **Trivial dispatchers / CPO glue:** `_CCCL_TRIVIAL_API`. These are always
+  force-inlined and debuggers step through them transparently.
+- **"Step-over" utilities** (`cuda::std::move`, `cuda::std::forward`): `_CCCL_NODEBUG_API`.
+  Debuggers show the frame in a stacktrace but won't let you set the active
+  frame to it.
+- **Function whose address appears in a public type:** `_CCCL_PUBLIC_API`.
+  GCC warns if a `hidden` function's address is embedded in a `default`-visibility type.
+- **CDP-callable functions:** `_CCCL_CDP_API` — expands to `_CCCL_API` when CDP
+  is available, `_CCCL_HOST_API` otherwise.
+
+`_CCCL_API` and `_CCCL_FORCEINLINE` cannot be combined — `_CCCL_API` already
+implies `inline` through `_CCCL_HIDE_FROM_ABI`. Use `_CCCL_TRIVIAL_API` instead.
+
+## Raw visibility macros (avoid in new code)
+
+| Macro                           | Expands to                                                        |
+|---------------------------------|-------------------------------------------------------------------|
+| `_CCCL_VISIBILITY_HIDDEN`       | `__attribute__((__visibility__("hidden")))`                       |
+| `_CCCL_VISIBILITY_DEFAULT`      | `__attribute__((__visibility__("default")))`                      |
+| `_CCCL_VISIBILITY_EXPORT`       | `__declspec(dllexport)` on MSVC; `_CCCL_VISIBILITY_DEFAULT` elsewhere |
+| `_CCCL_TYPE_VISIBILITY_HIDDEN`  | Hidden type visibility (`__type_visibility__` if available)       |
+| `_CCCL_TYPE_VISIBILITY_DEFAULT` | Default type visibility                                           |
+| `_CCCL_FORCEINLINE`             | `__forceinline` (MSVC) / `__inline__ __attribute__((__always_inline__))` |
+| `_CCCL_FORCEINLINE_LAMBDA`      | `__attribute__((__always_inline__))` on lambdas (non-MSVC)        |
+| `_CCCL_EXCLUDE_FROM_EXPLICIT_INSTANTIATION` | `__attribute__((__exclude_from_explicit_instantiation__))` if available |
+| `_CCCL_HIDE_FROM_ABI`           | `_CCCL_VISIBILITY_HIDDEN _CCCL_EXCLUDE_FROM_EXPLICIT_INSTANTIATION inline` |
+
+## NVHPC workaround
+
+NVHPC has issues with visibility attributes on symbols with internal linkage.
+All visibility macros degrade gracefully on NVHPC — `_CCCL_API` expands to
+`_CCCL_HOST_DEVICE` alone.
+
+## Kernel attributes
+
+```cpp
+_CCCL_KERNEL_ATTRIBUTES           // __global__ _CCCL_VISIBILITY_HIDDEN
+_CCCL_LAUNCH_BOUNDS(N)            // __launch_bounds__(N) unless RDC active
+_CCCL_LAUNCH_BOUNDS(N, M)         // __launch_bounds__(N, M) unless RDC active
+_CCCL_GRID_CONSTANT               // __grid_constant__ for const kernel params (sm70+, CUDA 12.8+)
+_CCCL_BLOCK_SIZE(NTID, NCTA)      // __block_size__ / __cluster_dims__ on hopper+ (CUDA 12.9+)
+```
+
+`_CCCL_KERNEL_ATTRIBUTES` can be redefined by users until CCCL 4.0 (for
+backwards compat with `CUB_DETAIL_KERNEL_ATTRIBUTES`). Use
+`_CCCL_KERNEL_ATTRIBUTES` on all kernel definitions — never raw `__global__`.
+
+## `_CCCL_GLOBAL_VARIABLE`
+
+Marks file-scope variables accessible from device code. Expands to `__device__`
+during device compilation (except NVHPC), empty otherwise. Required for
+non-builtin-type global variables referenced in device code.
diff --git a/.agent/skills/cccl_detail-devcontainer-matrix/SKILL.md b/.agent/skills/cccl_detail-devcontainer-matrix/SKILL.md
new file mode 100644
index 00000000000..2d26887d2ff
--- /dev/null
+++ b/.agent/skills/cccl_detail-devcontainer-matrix/SKILL.md
@@ -0,0 +1,75 @@
+---
+description: |
+  CCCL devcontainer matrix expansion — `make_devcontainers.sh`, 60+ generated
+  configs from `ci/matrix.yaml`'s `devcontainers:` section, the verify-devcontainers
+  CI workflow, and `[skip-vdc]` policy. Reference for understanding the generation
+  pipeline, naming convention, and when regeneration is required.
+  Triggers: "make_devcontainers.sh", "regenerate devcontainers", "skip-vdc",
+  "devcontainer matrix expansion", "verify-devcontainers".
+---
+
+The generated `.devcontainer/` subdirs are produced from `ci/matrix.yaml` and are not
+hand-edited. This skill covers what drives them, what the CI check enforces, and when
+`[skip-vdc]` is safe to apply.
+
+## Generation pipeline
+
+`ci/matrix.yaml` → `make_devcontainers.sh` → `.devcontainer/cuda<CTK>-<compiler>/devcontainer.json`
+
+The `devcontainers:` section of `ci/matrix.yaml` lists `dc` and `dc_ext` job entries that
+enumerate the CTK × host-compiler combinations requiring devcontainers. These entries have
+no corresponding build/test scripts — they exist solely to drive container generation.
+
+`make_devcontainers.sh` calls `.github/actions/workflow-build/build-workflow.py` with
+`--devcontainer-info` to resolve aliases (`12.X`, `13.X`) and emit a JSON combination
+list, then renders a `devcontainer.json` for each entry.
+
+## Naming convention
+
+`.devcontainer/cuda<CTK>[-ext]-<compiler><version>/`
+
+| Suffix     | When                                                    |
+|------------|-------------------------------------------------------|
+| _(none)_  | Standard CTK image                                      |
+| `ext`     | Extended CTK image (`dc_ext` job type; extra CUDA libraries) |
+
+Examples: `cuda13.2-gcc14/`, `cuda12.9ext-llvm20/`, `cuda13.0-nvhpc25.11/`.
+
+The four `cuda99.8` and `cuda99.9` entries are internal NVIDIA images (compiler versions
+pulled from `cuda99_gcc_version` / `cuda99_clang_version` in `matrix.yaml`). They are
+generated but excluded from the `verify-devcontainers` test matrix.
+
+## Template
+
+`.devcontainer/devcontainer.json` is both the base template and the default container
+(updated in-place to the newest GCC × highest-CTK combination on each regeneration run).
+The per-combination `devcontainer.json` files inherit its structure; `make_devcontainers.sh`
+stamps each with the correct image, name, and `CCCL_*` env vars via `jq`.
+
+Direct edits to per-combination `devcontainer.json` files are overwritten on the next
+regeneration run. Edit the base template or `ci/matrix.yaml`, then regenerate.
+
+## verify-devcontainers workflow
+
+`.github/workflows/verify-devcontainers.yml` runs `make_devcontainers.sh --verbose --clean`
+and asserts the working tree is clean. A dirty tree means the committed files are out of
+sync with what the matrix produces. It then fans out across all non-`cuda99` containers and
+runs `.devcontainer/verify_devcontainer.sh` inside each one.
+
+The workflow fires when `.devcontainer/`, `ci/matrix.yaml`, or
+`.github/actions/workflow-build/build-workflow.py` are modified in a PR.
+
+## `[skip-vdc]` policy
+
+`[skip-vdc]` in the last commit message disables the verify-devcontainers jobs for that PR run.
+
+Safe to apply when: the PR touches neither `.devcontainer/`, `ci/matrix.yaml`, nor
+`build-workflow.py`, and the devcontainer state is known-good.
+
+Not safe to apply on PRs that modify any of those three paths — a dirty-tree failure
+will surface on merge and block the branch.
+
+## Additional resources
+
+- `cccl-devcontainer` — step-by-step regen walkthrough and validation (its `references/regenerate.md`).
+- `references/tools.md` — `make_devcontainers.sh` ownership and usage cross-reference.
diff --git a/.agent/skills/cccl_detail-devcontainer-matrix/references/tools.md b/.agent/skills/cccl_detail-devcontainer-matrix/references/tools.md
new file mode 100644
index 00000000000..9a942335c3f
--- /dev/null
+++ b/.agent/skills/cccl_detail-devcontainer-matrix/references/tools.md
@@ -0,0 +1,13 @@
+# Tool index — cccl_detail-devcontainer-matrix
+
+## Owned (canonical reference lives here)
+
+| Tool | Purpose | Detail |
+|------|---------|--------|
+| `.devcontainer/make_devcontainers.sh` | Generates all `.devcontainer/{cuda-X.Y}-{compiler}/` subdirectories from the `devcontainers:` section of `ci/matrix.yaml`. Must be re-run whenever `ci/matrix.yaml` adds/changes/removes devcontainer entries. | see `cccl-devcontainer` → `references/regenerate.md` for the step-by-step regen workflow |
+
+## Used (canonical reference lives in another skill)
+
+| Tool | Purpose | Reference |
+|------|---------|-----------|
+| `.devcontainer/launch.sh` | Used to verify generated containers build and launch correctly after regeneration. | `cccl-devcontainer` → `references/launch_usage.md` |
diff --git a/.agent/skills/cccl_detail-examples/SKILL.md b/.agent/skills/cccl_detail-examples/SKILL.md
new file mode 100644
index 00000000000..7eb6de0c525
--- /dev/null
+++ b/.agent/skills/cccl_detail-examples/SKILL.md
@@ -0,0 +1,93 @@
+---
+description: |
+  Top-level `examples/` directory — CPM-consumption tests that verify CCCL can be
+  fetched via CPM and built downstream. Covers `cccl_add_compile_test` signature,
+  how each subdirectory is a self-contained CMake project, differences from
+  per-library examples, the `packaging` CI preset, and authoring a new example.
+  Triggers: "how do the top-level examples work", "add a new example", "what is
+  cccl_add_compile_test", "examples/ CPM test".
+---
+
+The top-level `examples/` directory is not a unit-test suite — it is a set of
+downstream-consumer validation builds. Each subdirectory is an independent CMake
+project that fetches CCCL via CPM and builds against it, proving that a real user
+can depend on CCCL from GitHub.
+
+## What the directory contains
+
+| Subdirectory                | What it exercises                                       |
+|-----------------------------|-------------------------------------------------------|
+| `basic/`                    | Bare `CCCL::CCCL` link target; serves as the canonical starter template |
+| `ccclrt/`                   | `ccclrt` kernel launch patterns                         |
+| `cudax/`                    | `CCCL::cudax` (requires `CCCL_ENABLE_UNSTABLE ON`)      |
+| `cudax_stf/`                | STF APIs                                                |
+| `thrust_flexible_device_system/` | Thrust with configurable device system (CUDA/OMP/TBB/CPP) |
+
+Each subdirectory carries its own `cmake/CPM.cmake`, `CMakeLists.txt`, and at
+least one source file. They have no dependency on the CCCL build system and
+compile cleanly as standalone projects.
+
+## How `cccl_add_compile_test` works
+
+Defined in `cmake/CCCLUtilities.cmake`. Signature:
+
+```cmake
+cccl_add_compile_test(
+  <output_test_name_var>
+  <name_prefix>       # e.g. cccl.example
+  <subdir>            # relative path to the standalone project
+  <test_id>           # disambiguates multiple configs for the same subdir
+  [CTEST_COMMAND <cmd>]
+  [<additional cmake -D options>...]
+)
+```
+
+The function registers a CTest test whose command is `ctest --build-and-test`,
+which configures, builds, and runs the subdirectory's own CTest suite in one
+step. The resulting test name is `<name_prefix>.<subdir>.<test_id>`.
+
+The top-level `examples/CMakeLists.txt` passes two CPM-overriding variables to
+every invocation:
+
+- `-DCCCL_REPOSITORY` — local repo path (overrides the public GitHub URL during CI)
+- `-DCCCL_TAG` — `GITHUB_SHA` in CI, `HEAD` locally
+
+This lets CI validate the current PR's tree without pushing to GitHub first.
+
+## Per-library examples vs top-level `examples/`
+
+Per-library examples (e.g., `cub/examples/`, `thrust/examples/`) live inside their
+library's source tree, link against the in-tree build, and are controlled by each
+library's own CMake options. They test the in-tree build.
+
+Top-level `examples/` tests the *packaging and export* surface: does `CPMAddPackage`
+produce a usable `CCCL::CCCL` target for a downstream CMake project?
+
+## CI integration
+
+Examples run under the `packaging` CMake preset (`CCCL_ENABLE_EXAMPLES=true`,
+`CCCL_ENABLE_TESTING=true`). The driver script is `ci/test_packaging.sh`, which:
+
+1. Sets `CCCL_EXAMPLE_CPM_REPOSITORY` to the local repo root.
+2. Sets `CCCL_EXAMPLE_CPM_TAG` to `GITHUB_SHA` (or `HEAD` locally).
+3. Configures and builds the `packaging` preset.
+4. Runs `ctest`.
+
+In `ci/matrix.yaml`, `project: packaging` entries appear in `pull_request`,
+`pull_request_lite`, and `nightly` workflows, covering CTK 12.0–13.X with gcc
+and clang. A `-min-cmake` variant tests against CMake 3.18 (the minimum required).
+GPU is required — the examples run device kernels via CTest.
+
+## Authoring a new example
+
+1. Create `examples/<your-example>/` with:
+   - `cmake/CPM.cmake` — copy from any existing subdirectory.
+   - `CMakeLists.txt` — `project(...)`, `include(cmake/CPM.cmake)`, `CPMAddPackage(NAME CCCL ...)`, add target, `include(CTest)`, `add_test(...)`.
+   - Source file(s).
+2. Register in `examples/CMakeLists.txt` with one or more `cccl_add_compile_test` calls.
+3. If the example needs a non-default Thrust device system or other per-config variation, add a `foreach` loop (see `thrust_flexible_device_system` for the pattern).
+4. Verify locally: `cmake -S . -B build -DCCCL_ENABLE_EXAMPLES=ON && cmake --build build && ctest --test-dir build`.
+
+## Additional resources
+
+- `references/docs.md` — index of `examples/` documentation (README files for each example).
diff --git a/.agent/skills/cccl_detail-examples/references/docs.md b/.agent/skills/cccl_detail-examples/references/docs.md
new file mode 100644
index 00000000000..82d8ce339bb
--- /dev/null
+++ b/.agent/skills/cccl_detail-examples/references/docs.md
@@ -0,0 +1,21 @@
+# Documentation index — cccl_detail-examples
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `examples/README.md` | CPM-based downstream examples verifying CCCL integration as an external dependency. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `examples/basic/README.md` | Hello-world examples using Thrust, CUB, and libcudacxx. |
+| `examples/cudax/README.md` | Experimental features demonstration examples. |
+| `examples/thrust_flexible_device_system/README.md` | Custom device system implementation example. |
+
+## See also
+
+- `cccl-cub` `references/docs.md` for CUB API documentation.
+- `cccl-thrust` `references/docs.md` for Thrust API documentation.
+- `cccl-cudax` `references/docs.md` for cudax API documentation.
diff --git a/.agent/skills/cccl_detail-github/SKILL.md b/.agent/skills/cccl_detail-github/SKILL.md
new file mode 100644
index 00000000000..9f59756bb48
--- /dev/null
+++ b/.agent/skills/cccl_detail-github/SKILL.md
@@ -0,0 +1,108 @@
+---
+description: |
+  CCCL GitHub-side infrastructure — issue and PR templates, CODEOWNERS review routing,
+  copy-pr-bot (internal mirror and CI trust), CodeRabbit AI review configuration,
+  non-CI automation workflows (backport, triage rotation, project sync, release tooling,
+  docs deploy, blackduck), and release changelog config.
+  Triggers: "how does CODEOWNERS work", "add someone to copy-pr-bot", "issue templates",
+  "coderabbit config", "how does backporting work", "PR template".
+---
+
+Reference skill — CCCL's GitHub-side repository infrastructure. Covers templates, bots, and automation workflows that are not part of the build/test CI pipeline. For CI matrix, skip tags, and `/ok to test`, see `cccl-ci`.
+
+## Issue templates
+
+Six structured templates under `.github/ISSUE_TEMPLATE/`:
+
+| File                  | Title prefix | Auto-label | Purpose                                           |
+|----------------------|---|---|---|
+| `1-bug_report.yml`    | `[BUG]:`     | —          | Bug type dropdown, component, reproducer, system info |
+| `2-feature_request.yml` | `[FEA]:`   | —          | Area dropdown, problem + solution + alternatives |
+| `3-doc_request.yml`   | `[DOC]:`     | —          | New vs correction, link to incorrect/missing docs |
+| `infra_ticket.yml`    | `[INFRA]:`   | `infra`    | CMake/GHA/CI/devcontainer requests; Slack for critical issues |
+| `devex_ticket.yml`    | `[DEVEX]:`   | `devex`    | Low-priority developer-experience improvements |
+| `config.yml`          | —            | —          | Enables blank issues; links to Discussions and Discord |
+
+All templates auto-add to `NVIDIA/6` GitHub project board. Load-bearing fields: **component/area dropdown** (routes triage), **duplicate confirmation checkbox** (required on bug/doc), **reproduction link** (optional but preferred on bugs).
+
+## PR template
+
+`.github/PULL_REQUEST_TEMPLATE.md` — three load-bearing elements:
+
+- `closes <!-- Link issue here -->` — PR title feeds CHANGELOG; issue link drives project board sync.
+- Tests checkbox — reviewers check this; CI does not enforce it.
+- Documentation checkbox — signal to reviewer, not gating.
+
+PR title is included verbatim in the release CHANGELOG (via `.github/release.yml` label-to-category mapping).
+
+## CODEOWNERS
+
+`.github/CODEOWNERS` routes required reviews by path:
+
+| Pattern                                             | Team                                   |
+|-----------------------------------------------------|----------------------------------------|
+| `thrust/`, `cub/`, `libcudacxx/`, `cudax/`, `c/`, `python/` | Per-library `cccl-*-codeowners` teams |
+| `.github/`, `ci/`, `.devcontainer/`, `.pre-commit-config.yaml`, `.clang-format`, `.clangd`, `c2h/`, `nvbench_helper/`, `.vscode`, `.coderabbit.yaml` | `cccl-infra-codeowners` |
+| `**/CMakeLists.txt`, `**/cmake/`                    | `cccl-cmake-codeowners` (cudax test CMakeLists overrides to cudax team) |
+| `benchmarks/`, `**/benchmarks`                      | `cccl-benchmark-codeowners` |
+| `README.md`, `docs/`, `examples/`                   | `cccl-codeowners` (general)            |
+
+All teams are under the `@nvidia/` org prefix.
+
+## copy-pr-bot
+
+`.github/copy-pr-bot.yaml` — GitHub App that mirrors every public PR to NVIDIA's internal repo so internal CI runners can safely execute it (public runners cannot access private infrastructure).
+
+Config knobs:
+- `enabled: true` — bot is active.
+- `auto_sync_draft: false` — draft PRs are not mirrored until marked ready.
+- `additional_trustees` — GitHub usernames whose PRs are automatically trusted for mirroring without a manual `/ok to test` approval. Currently `ahendriksen` and `gonzalobg`.
+
+The `/ok to test` trigger that fires CI on the mirrored PR is handled separately — see `cccl-ci`.
+
+## CodeRabbit
+
+`.coderabbit.yaml` at repo root (owned by `cccl-infra-codeowners`). AI review bot that posts inline comments on PRs targeting `main` or `branch/X.Y.x` release branches.
+
+Key configuration:
+- **Profile:** `chill` — avoids nitpicky style comments.
+- **Comment prefixes:** `suggestion:` / `important:` / `critical:` — three-level severity. No praise, no headings, no emoji.
+- **Drafts:** skipped (`auto_review.drafts: false`).
+- **Ignored bots:** `copy-pr-bot`, `dependabot[bot]`, `github-actions[bot]`, `nv-automation-bot`.
+- **Per-path instructions:** each major subdirectory (`libcudacxx/`, `cub/`, `thrust/`, `cudax/`, `c/`, `python/`, `benchmarks/`, `docs/`, `ci/`, `.github/`) has a path instruction block tuning review focus.
+- **Knowledge base:** ingests `AGENTS.md`, `CONTRIBUTING.md`, the libcudacxx style skill, and key RST docs.
+- **Finishing touches:** Autofix enabled; docstring/unit-test/simplify generation disabled.
+
+## Non-CI workflows
+
+Workflows under `.github/workflows/` that are not CCCL build/test pipelines:
+
+| Workflow                            | Trigger                | Role                                              |
+|-------------------------------------|----------------------|--------------------------------------------------|
+| `backport-prs.yml`                  | PR merged or `/backport` comment | Opens backport PR to release branches |
+| `triage_rotation.yml`               | Issue opened (non-member) | Auto-assigns external issues for triage |
+| `project_automation_sync_pr_issues.yml` | PR opened/edited     | Syncs linked issues to project board |
+| `project_automation_set_in_progress.yml` | PR opened/edited    | Moves PR + issues to "In Progress" |
+| `project_automation_set_in_review.yml` | PR event            | Moves to "In Review" state             |
+| `project_automation_set_roadmap.yml` | Issue/PR event       | Sets roadmap column on project board   |
+| `docs-deploy.yml`                   | Push to `main` or manual | Builds and deploys Sphinx docs to GitHub Pages |
+| `update-branch-version.yml`         | Manual               | Bumps version numbers in a release branch |
+| `release-create-new.yml`            | Manual               | Creates new release branch + initial PR |
+| `release-update-rc.yml`             | Manual               | Updates release candidate state        |
+| `release-finalize.yml`              | Manual               | Finalizes and tags a release           |
+| `release-wheels.yml`                | Manual               | Publishes Python wheels to PyPI        |
+| `blackduck-sca.yml`                 | Push to `main` or manual | Black Duck software composition analysis (security/license) |
+| `build-rapids.yml`                  | Manual/scheduled      | Downstream compatibility — builds RAPIDS against CCCL |
+| `build-pytorch.yml`                 | Manual/scheduled      | Downstream compatibility — builds PyTorch against CCCL |
+| `build-matx.yml`                    | Manual/scheduled      | Downstream compatibility — builds MatX against CCCL |
+| `verify-devcontainers.yml`          | PR/push              | Verifies devcontainer image builds     |
+
+Release workflows have a sequenced README at `.github/workflows/release-README.md`.
+
+## Release changelog
+
+`.github/release.yml` — GitHub's auto-generated release notes config. Maps PR labels to CHANGELOG sections: Thrust/CUB, libcudacxx, cuda.coop, cuda.compute, Documentation, Other Changes. PR title is the changelog entry — keep titles descriptive.
+
+## Additional resources
+
+- `references/docs.md` — index of GitHub-side infrastructure documentation (branching, backport, release workflows).
diff --git a/.agent/skills/cccl_detail-github/references/docs.md b/.agent/skills/cccl_detail-github/references/docs.md
new file mode 100644
index 00000000000..04d1859e584
--- /dev/null
+++ b/.agent/skills/cccl_detail-github/references/docs.md
@@ -0,0 +1,24 @@
+# Documentation index — cccl_detail-github
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `docs/maintainers/branching_strategy.rst` | Main, release (`branch/X.Y.x`), and development branch conventions; promotion path. |
+| `docs/maintainers/backport_process.rst` | Cherry-picking bugfixes to release branches; `/backport` comment trigger for automation. |
+| `.github/workflows/release-README.md` | Release workflow numbering, RC vs. final promotion path, manual workflow sequence. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/maintainers/index.rst` | Maintainer policies and procedures landing page. |
+| `docs/maintainers/how_tos/index.rst` | Practical maintainer guides (release, backport, etc.). |
+| `docs/maintainers/references/index.rst` | Maintainer reference material index. |
+| `docs/maintainers/coderabbit.rst` | CodeRabbit AI review configuration overview. |
+| `.github/PULL_REQUEST_TEMPLATE.md` | PR template: issue link, tests checkbox, documentation checkbox. |
+
+## See also
+
+- `cccl_detail-ci` `references/docs.md` for CI-pipeline documentation.
+- `cccl-infra` `references/` for release-cut and CTK-bump runbooks.
diff --git a/.agent/skills/cccl_detail-release/SKILL.md b/.agent/skills/cccl_detail-release/SKILL.md
new file mode 100644
index 00000000000..b888200df9c
--- /dev/null
+++ b/.agent/skills/cccl_detail-release/SKILL.md
@@ -0,0 +1,88 @@
+---
+description: |
+  CCCL versioning and release pipeline — `cccl-version.json`, `ci/update_version.sh`,
+  release workflows (numbered 0–3), Python wheel versioning via `setuptools_scm`,
+  header version macros, branch/tag conventions, and the RC → final promotion path.
+  Triggers: "how does CCCL versioning work", "how do I cut a release", "what is cccl-version.json",
+  "release branch convention", "how are wheels versioned".
+---
+
+Single source of truth for CCCL version numbers is `cccl-version.json` at the repo root.
+The `ci/update_version.sh` script propagates a new version to every consumer in one pass.
+Releases follow a numbered four-step workflow, each a manual `workflow_dispatch`.
+
+## Version pipeline
+
+`cccl-version.json` holds `{ "full", "major", "minor", "patch" }`.
+`ci/update_version.sh <major> <minor> <patch>` writes every downstream target atomically.
+`--dry-run` shows a diff without touching the tree.
+
+Targets updated by `update_version.sh`:
+
+| Target                                      | Format                                        |
+|---------------------------------------------|-----------------------------------------------|
+| `cccl-version.json`                         | plain `x.y.z`                                 |
+| `libcudacxx/include/cuda/std/__cccl/version.h` | `CCCL_VERSION` → `MMMmmmppp` (9 digits)   |
+| `thrust/thrust/version.h`                   | `THRUST_VERSION` → `MMMmmmpp` (8 digits)      |
+| `cub/cub/version.cuh`                       | `CUB_VERSION` → same 8-digit scheme            |
+| `lib/cmake/*/`-config-version.cmake (×5)   | separate `MAJOR`/`MINOR`/`PATCH` vars         |
+| `python/cuda_cccl/cuda/cccl/_version.py`   | `__version__ = "x.y.z"`                       |
+| `docs/VERSION.md`                           | `x.y` (major.minor only)                      |
+
+The script guards against downgrades: it compares the encoded new version against both
+the current `CCCL_VERSION` define and the latest `git tag`.
+
+## Python wheel versioning
+
+`python/cuda_cccl/pyproject.toml` declares `dynamic = ["version"]` and sources it via
+`scikit_build_core.metadata.setuptools_scm` with `root = "../.."`.
+At build time `setuptools_scm` derives the version from the nearest git tag.
+Release builds tag first (`vX.Y.Z`), then build wheels — the tag drives the wheel version.
+`python/cuda_cccl/cuda/cccl/_version.py` carries a static fallback (`__version__`) updated
+by `update_version.sh` for editable / non-tag installs.
+
+## Branch and tag conventions
+
+| Artifact                | Pattern                          | Example                        |
+|-------------------------|----------------------------------|--------------------------------|
+| Release branch          | `branch/{major}.{minor}.x`       | `branch/3.4.x`                 |
+| Release candidate tag   | `v{full}-rc{N}`                  | `v3.4.0-rc0`                   |
+| Final release tag       | `v{full}`                        | `v3.4.0`                       |
+| Version-bump PR branch  | `pr/ver/{branch}-v{version}`     | `pr/ver/branch/3.4.x-v3.4.1`   |
+
+## Release workflows (numbered steps)
+
+All four workflows are `workflow_dispatch` only — no automatic triggers.
+
+**Step 0 — Update version in target branch** (`update-branch-version.yml`)
+Creates a `pr/ver/…` branch, runs `update_version.sh`, opens a PR into the target branch.
+Shared composite action: `.github/actions/version-update/action.yml`.
+
+**Step 1 — Begin Release Cycle** (`release-create-new.yml`)
+Run on `main` (or the commit to branch from). Creates `branch/{major}.{minor}.x` from the
+current SHA, then calls the version-update action to bump `main` to the next version.
+
+**Step 2 — Test and Tag New RC** (`release-update-rc.yml`)
+Run on `branch/{major}.{minor}.x`. Reads version from `cccl-version.json`, auto-increments
+the RC counter (scanning existing `vX.Y.Z-rcN` tags), pushes `v{full}-rc{N}`.
+
+**Step 3 — Create Final Release** (`release-finalize.yml`)
+Run on an RC tag (`vX.Y.Z-rcN`). Validates `cccl-version.json` matches the tag, generates
+source + install archives via `ci/install_cccl.sh`, pushes `vX.Y.Z` tag from the RC tag,
+creates a draft GitHub release. Optional `create_patch_version` input opens a patch-bump PR.
+
+**Wheel publish** (`release-wheels.yml`)
+Separate manual workflow. Takes a prior GHA `run-id` (the validated build artifacts) and a
+destination (`pypi` or `testpypi`). Publishes via `pypa/gh-action-pypi-publish`.
+
+## Hard prohibitions
+
+- Never run `update_version.sh` with a version lower than the current `CCCL_VERSION` or the
+  latest git tag — the script rejects it, but don't try to work around the guard.
+- Step 3 must be started on an RC tag, not a branch ref.
+- Wheel publish must reference a previously validated build run — don't build and publish
+  in the same step.
+
+## Additional resources
+
+- `references/docs.md` — index of release workflow and versioning documentation.
diff --git a/.agent/skills/cccl_detail-release/references/docs.md b/.agent/skills/cccl_detail-release/references/docs.md
new file mode 100644
index 00000000000..c87e471671e
--- /dev/null
+++ b/.agent/skills/cccl_detail-release/references/docs.md
@@ -0,0 +1,18 @@
+# Documentation index — cccl_detail-release
+
+## Primary
+
+| Path | What it covers |
+|------|----------------|
+| `.github/workflows/release-README.md` | Release workflow numbering (steps 0–3), RC vs. final promotion path, manual workflow sequence and prerequisites. |
+
+## Adjacent
+
+| Path | What it covers |
+|------|----------------|
+| `docs/VERSION.md` | `cccl-version.json` format, `setuptools_scm` integration for Python wheel versioning. |
+
+## See also
+
+- `cccl-infra` `references/docs.md` for maintainer docs covering the broader release and maintenance context.
+- `cccl_detail-github` `references/docs.md` for branching strategy and GitHub release notes config.
diff --git a/.agent/skills/cccl_detail-test-params/SKILL.md b/.agent/skills/cccl_detail-test-params/SKILL.md
new file mode 100644
index 00000000000..0dfda38f945
--- /dev/null
+++ b/.agent/skills/cccl_detail-test-params/SKILL.md
@@ -0,0 +1,84 @@
+---
+description: |
+  CCCL %PARAM% test parameterization — `cmake/CCCLTestParams.cmake`, comment syntax,
+  cartesian-product axis expansion, generated CTest names, and `VAR_IDX` preprocessor
+  injection. Currently used in CUB test trees; not yet adopted by Thrust or cudax.
+  See `cccl-test` to run the generated test variants.
+  Triggers: "%PARAM% test", "CCCLTestParams.cmake", "test variant expansion",
+  "VAR_IDX", "cccl_parse_variant_params".
+---
+
+Parameterize a single `.cu`/`.cpp` test source into multiple CTest executables via
+`%PARAM%` source comments. `cmake/CCCLTestParams.cmake` parses the comments at configure
+time and expands them into the cartesian product of all axis values.
+
+## Syntax
+
+Each `%PARAM%` line follows this form:
+
+```
+// %PARAM% <DEFINITION> <label> <val0>:<val1>:...
+```
+
+- **DEFINITION** — preprocessor macro injected into each variant build. Convention: `TEST_` prefix.
+- **label** — short token included in the generated executable name.
+- **values** — colon-separated list. Only numeric values are tested in practice.
+
+Example with two axes:
+
+```cpp
+// %PARAM% TEST_FOO foo 0:1:2
+// %PARAM% TEST_LAUNCH lid 0:1
+```
+
+## Expansion
+
+`cccl_parse_variant_params` builds the cartesian product. For the example above, six
+variants are generated:
+
+| Executable name      | Compile definitions                      |
+|----------------------|------------------------------------------|
+| `<base>.foo_0.lid_0` | `TEST_FOO=0 TEST_LAUNCH=0 VAR_IDX=0`     |
+| `<base>.foo_0.lid_1` | `TEST_FOO=0 TEST_LAUNCH=1 VAR_IDX=1`     |
+| `<base>.foo_1.lid_0` | `TEST_FOO=1 TEST_LAUNCH=0 VAR_IDX=2`     |
+| `<base>.foo_1.lid_1` | `TEST_FOO=1 TEST_LAUNCH=1 VAR_IDX=3`     |
+| `<base>.foo_2.lid_0` | `TEST_FOO=2 TEST_LAUNCH=0 VAR_IDX=4`     |
+| `<base>.foo_2.lid_1` | `TEST_FOO=2 TEST_LAUNCH=1 VAR_IDX=5`     |
+
+`VAR_IDX` is always injected alongside the declared definitions; it identifies the
+variant's position in the cartesian product.
+
+## Name convention
+
+Variant suffix: `.label_value` per axis, dot-separated, appended to the base test name.
+Base name is derived from the source filename by the consuming `CMakeLists.txt`.
+
+In CUB, a `lid` axis (`lid 0:1:2`) also controls the launch mode:
+- `lid_0` — host launch, RDC off
+- `lid_1` — device launch (CDP), RDC on
+- `lid_2` — graph capture, RDC off
+
+## CMake API
+
+Three public functions in `cmake/CCCLTestParams.cmake`:
+
+| Function                                       | Purpose                                  |
+|------------------------------------------------|------------------------------------------|
+| `cccl_parse_variant_params(src num_var labels defs)` | Parse source; populate label and definition lists |
+| `cccl_get_variant_data(labels defs idx label_var defs_var)` | Extract label + defs for one variant index |
+| `cccl_log_variant_params(base num labels defs)` | Emit detected variant info at `VERBOSE` log level |
+
+## Reconfiguration note
+
+CMake does not track source file changes for reconfiguration. After modifying `%PARAM%`
+comments, rerun CMake manually.
+
+## Parameter axis guidance
+
+Split only parameters that change template instantiations (typically input value types).
+Splitting integral parameters such as `BLOCK_THREADS` compiles redundant code into
+separate executables and inflates build time.
+
+## Cross-reference
+
+`cccl-test` — how to build and run the generated CTest variants.
diff --git a/AGENTS.md b/AGENTS.md
index f8ba92d65b3..e3779d458ee 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -2,9 +2,11 @@
 
 ## Load the `cccl` skill first
 
-Load the `cccl` skill via the Skill tool. It maps the available repo-local skills and agents and routes by user intent.
+Load the `cccl` skill via the Skill tool. It is the entry-point router for the `cccl-*` skill and agent
+family, and carries the full intent → skill routing table. Every session begins here.
 
-If you don't know what skills or agents are, load `cccl-agent-impl` first.
+The `cccl` skill points at `references/skills-and-agents.md` for the complete catalog and invocation
+mechanics.
 
 ## What CCCL is
 
@@ -44,34 +46,19 @@ cccl/
 └── CMakePresets.json
 ```
 
-`.agent/` is canonical; `.claude/skills` and `.claude/agents` symlink to it so both Claude Code and Codex find the
-same files.
+`.agent/` is canonical; `.claude/skills` and `.claude/agents` symlink to it so both Claude Code and Codex find
+the same files. Reference docs live in `CONTRIBUTING.md`, `ci-overview.md`, and `docs/cccl/development/`.
 
-## Skill routing
+## Skill and agent naming
 
-The `cccl` skill carries the full table. Common entries:
+- Entry skills use the `cccl-` prefix and appear in `/cccl-` slash autocomplete.
+- Detail skills use the `cccl_detail-` prefix (underscore between `cccl` and `detail`), auto-load via
+  description match, and are excluded from slash autocomplete.
 
-- Commit uncommitted changes / wrap up a fix → `cccl-commit`
-- Resplit / clean up a branch's commit history → `cccl-resplit-branch`
-- Open / edit / comment on a PR / trigger CI → `cccl-pr`
-- CI overview / matrix / skip tags / `/ok to test` → `cccl-ci`
-- Triage failed CI → `cccl-triage-pr` or `cccl-triage-nightly`
-- Benchmarks → `cccl-ci-benchmarks`
-- Git bisect → `cccl-bisect`
-- Devcontainers → `cccl-devcontainers`
-- Targeted build/test (fast iteration) → `cccl-build-and-test-targets`
-- Full-matrix C++ build/test scripts → `cccl-cpp-builds`
-- Python packages (cuda-cccl) → `cccl-python`
-- libcudacxx code style → `cccl-libcudacxx-style`
-- SASS / PTX comparison → `cccl-sass-diff`
-- Stuck on a decision → `cccl-clarify`
+## Known limitations
 
-Reference docs: `CONTRIBUTING.md`, `ci-overview.md`, `docs/cccl/development/`.
-
-## Known agent limitations
-
-- Long-running builds (60+ min) and tests (30+ min) are normal — never cancel them. Use
-  `cccl-build-and-test-targets` for fast iteration.
+Long-running builds (60+ min) and tests (30+ min) are normal — never cancel them. The `cccl-build` and
+`cccl-test` skills cover fast-iteration targeted builds and tests.
 
 ## Pre-commit
 

From 558437e4e57eea7d3a286a4891c0e0014af44d6e Mon Sep 17 00:00:00 2001
From: Allison Piper <alliepiper16@gmail.com>
Date: Thu, 14 May 2026 13:02:17 -0400
Subject: [PATCH 7/7] [skills] Add cccl-README.md framework overview

Top-level overview of the cccl-* skill and agent framework: purpose,
end-to-end prompt examples, approval gates, and detailed example
prompts per workflow area. Sits at .agent/skills/cccl-README.md as a
sibling to the cccl/ entry skill.

[skip-matrix][skip-vdc][skip-docs][skip-tpt]
---
 .agent/skills/cccl-README.md | 324 +++++++++++++++++++++++++++++++++++
 1 file changed, 324 insertions(+)
 create mode 100644 .agent/skills/cccl-README.md

diff --git a/.agent/skills/cccl-README.md b/.agent/skills/cccl-README.md
new file mode 100644
index 00000000000..fc9da2077fb
--- /dev/null
+++ b/.agent/skills/cccl-README.md
@@ -0,0 +1,324 @@
+# CCCL skill & agent framework
+
+## Overview
+
+The `cccl-*` skills and agents wrap CCCL's build, test, CI, benchmarking, commit/PR, and release
+infrastructure into named entry points navigated by intent. Top-level skills (`cccl-build`,
+`cccl-triage`, `cccl-commit`, `cccl-bench`, `cccl-infra`, …) drive user-facing workflows;
+`cccl_detail-*` skills hold shared reference material; read-only agents handle mechanical work like
+fetching failed jobs or summarizing logs. Each repeated workflow is encoded once, so every task
+starts from a known entry point with relevant project-specific details in context.
+
+---
+
+> **Approval gates remain.** Skills handle the research, drafting, splitting, and
+> message composition. Every `git add` / `commit` / `push`, every `gh pr` write
+> action, and every `/ok to test` still waits for explicit user approval.
+
+---
+
+## End-to-end prompt examples
+
+> "PR #8965 is failing in CI on the libcudacxx jobs for cuda13.2/gcc14 — figure
+> out why, fix it, commit with override tags so we don't re-run the green half of
+> the matrix, push, mark ready"
+
+`cccl-triage` (fetch + cluster + summarize) → engineer fix → `cccl-ci-overrides`
+(generate the override) → `cccl-commit` (test gate + commit message) →
+`cccl-pr` (push + ready + retrigger CI). End-to-end automation of the most
+expensive recurring workflow in this repo.
+
+> "device_radix_sort was 1.4x faster on tag 3.0. Bisect, validate the regression
+> isn't a SASS-level codegen surprise, fix it, commit, PR, request a bench run."
+
+`cccl-bisect` → `cccl-sass-diff` (validate it's a real algorithmic regression
+not codegen drift) → engineer fix → `cccl-bench` (verify locally) →
+`cccl-commit` → `cccl-pr` → `cccl-bench` (CI bench request with `[bench-only]`).
+
+> "Resplit this branch — it has 14 messy WIP commits, I want 3 clean ones split by
+> library, rebased on current main"
+
+`cccl-resplit-branch` → `cccl-commit`. Backs up tip to `refs/backup/<branch>-<ts>`,
+rebases (escalates conflicts via `cccl-clarify`), collapses to working-tree via
+`git reset --mixed main`, hands off to `cccl-commit` with the original commit subjects
+as starters.
+
+> "I'm onboarding a contributor today. They want to land a small CUB algorithm
+> change. Hand them the doc."
+
+`cccl` (entry router) → walks them through: `cccl-devcontainer` → `cccl-cub`
+(orientation) → `cccl-build` + `cccl-test` → `cccl-commit` → `cccl-pr`.
+
+---
+
+## 1. Daily inner loop — build, test, iterate
+
+> "Build cub for sm90, then run the device_radix_sort tests"
+
+`cccl-build` → `cccl-test`. Picks the right preset, runs the targeted build, ctest-regexes
+the requested suite, reports pass/fail. Fast iteration path, single preset, no matrix.
+
+> "I just touched `cub/cub/device/dispatch/dispatch_reduce.cuh`. Build cub fast and run
+> only the device_reduce tests."
+
+`cccl-build` → `cccl-test`. Targeted incremental build via `build_and_test_targets.sh`;
+filters CTest by regex.
+
+> "Run the libcudacxx lit tests for `cuda/std/__type_traits/scalar_type.h` under sm90"
+
+`cccl-test`. Picks libcudacxx preset, points lit at the right test directory.
+
+> "Open a shell in a devcontainer with CUDA 13.2 and gcc 14"
+
+`cccl-devcontainer`. Wraps `.devcontainer/launch.sh --cuda 13.2 --host gcc14`.
+Detects whether you're already inside a container.
+
+> "Build cudax with the cu13 nightly toolkit in a headless container, then run all
+> cudax tests"
+
+`cccl-devcontainer` → `cccl-build` → `cccl-test`. `-d` headless launch with
+`-- ./ci/build_cudax.sh` then `./ci/test_cudax.sh`.
+
+> "What CMake presets are available and which one builds everything for native arch?"
+
+`cccl-cmake`. Tabulates presets; recommends `all-dev`.
+
+---
+
+## 2. CI firefighting
+
+> "Triage PR #8963"
+
+`cccl-triage`. Resolves the PR's latest CI run, dispatches `cccl-ci-fetch-failures`
+to list failures, clusters by toolchain/library/variant, dispatches
+`cccl-ci-summarize-job-log` in parallel (haiku) on representatives, returns a compact
+failure-cluster table and asks which clusters to dig into.
+
+> "What's failing on the nightly?"
+
+`cccl-triage` (nightly mode). Same flow, run-id resolved from `nightly.yml`. Especially
+useful for the matrix-sized failure sets where you need clustering, not 200 raw logs.
+
+> "Just give me the failed jobs for the current branch -- I want to grep the list myself"
+
+`cccl-ci-fetch-failures` direct. Returns TSV: `<job-id>\t<full-name>\t<grouping-hint>`.
+
+> "Summarize this CI job log: https://github.com/NVIDIA/cccl/actions/runs/.../job/..."
+
+`cccl-ci-summarize-job-log`. Fetches the log, returns failing step, exact command line,
+5–20 lines of raw error, and a code/infra/flaky verdict.
+
+> "Generate a `workflows.override` so this PR only re-runs the cub and libcudacxx jobs
+> on gcc 14"
+
+`cccl-ci-overrides`. Reads `ci/matrix.yaml` schema, emits the minimum override matrix
+snippet plus recommended skip tags, with rationale.
+
+> "Why did the cuda12.6/clang14 job run for this PR? I didn't touch anything that
+> needs clang."
+
+`cccl-ci` + `cccl-ci-overrides`. Explains matrix expansion via
+`ci/inspect_changes.py` and `project_files_and_dependencies.yaml`, identifies the
+trigger path.
+
+> "Walk me through how PR CI is structured — what's the difference between the
+> `pull_request` and `nightly` workflows?"
+
+`cccl-ci`. Reference skill — flow diagram, sources of truth, skip-tag mechanics.
+
+---
+
+## 3. Regression hunting
+
+> "device_scan was 1.2x faster a week ago. Find the commit that regressed it."
+
+`cccl-bisect` (cloud route). Dispatches `git-bisect.yml` workflow with the right
+runner label, build/test targets, and good/bad refs. Returns the bad commit hash with
+the distinguishing command line — a local reproducer.
+
+> "Bisect this segfault on the cuda13.2/gcc14 config — it definitely worked on the
+> 3.0 release."
+
+`cccl-bisect`. Resolves `3.0` to a tag, runs cloud bisect, returns the bad commit
+with a reproducer command.
+
+> "Bisect locally in a devcontainer — I don't want to wait for the cloud queue"
+
+`cccl-bisect` (local route). Wraps `ci/util/git_bisect.sh` inside
+`.devcontainer/launch.sh`.
+
+> "Did my recent CUB tuning change affect codegen for `DeviceRadixSort`?"
+
+`cccl-sass-diff`. Builds both refs, dumps SASS via `cuobjdump`, normalizes addresses
+and register renames, reports the top 5 non-trivial diffs by kernel.
+
+---
+
+## 4. Commit / PR endgame
+
+> "Commit these changes"
+
+`cccl-commit`. Component selection → optional split → interactive chunk walkthrough
+→ optional test gate → commit message draft (Trivial/Standard/Detailed) → `git commit -F`.
+Refuses on `main`.
+
+> "Wrap this up — I want three separate commits split by library (cub, thrust,
+> libcudacxx). Run the precommit gate first."
+
+`cccl-commit`. Plans three commit groups, walks chunks, runs pre-commit, drafts per-group
+messages, executes each commit.
+
+> "Push and open a draft PR titled `[Tile] Reenable seed_seq tests`"
+
+`cccl-pr` (open new draft). Sanity-check, detect push remote, push branch, open draft PR
+with the title and body.
+
+> "Update the PR body to mention the SASS-diff results"
+
+`cccl-pr` (edit existing). `gh pr edit --body-file -`.
+
+> "Mark PR #9001 ready for review"
+
+`cccl-pr` (draft→ready transition).
+
+> "Trigger CI on this PR"
+
+`cccl-pr` (push + trigger). SHA verification gate, then `/ok to test <SHA>` comment.
+Never posts without verification.
+
+---
+
+## 5. Library development
+
+> "Add a CUB device-scope algorithm `cub::DeviceMode` that returns the most-frequent
+> value. Tour me through the directory layout and tuning policy conventions."
+
+`cccl-cub` (orientation) → manual implementation → `cccl-build` + `cccl-test` to
+verify. Covers block/warp/device/agent scopes, the tuning-policy selector pattern,
+and Catch2 vs legacy test layout.
+
+> "Make this cudax change libcudacxx-style compliant"
+
+`cccl-libcudacxx` (style references — `headers.md`, `macros.md`, `naming.md`,
+`templates.md`, `testing.md`, `visibility.md`). Style enforcement applies to
+`libcudacxx/include/` AND `cudax/include/`.
+
+> "Where do I add a new Thrust algorithm with CUDA + cpp + omp + tbb backends?"
+
+`cccl-thrust`. Explains the per-backend directory layout (`thrust/system/{cuda,cpp,omp,tbb}/`),
+the ADL dispatch via execution policies, and the typical pattern of `thrust::sort` →
+`cub::DeviceRadixSort` for the CUDA backend.
+
+> "What's the C ABI pattern for adding a new algorithm to the C Parallel Library?"
+
+`cccl-c`. Three-call pattern (`_build`, `_run`, `_cleanup`), stable C ABI layer,
+JIT-backed cubins via NVRTC, custom iterator/operator types via template strings.
+
+> "What's in cudax that's stable enough to graduate to libcudacxx?"
+
+`cccl-cudax` + `cccl-libcudacxx`. Covers the zero-stability contract and
+`CCCL_ENABLE_UNSTABLE` flag on the cudax side; the upstream-tracking model and
+where CCCL extensions live on the libcudacxx side.
+
+> "Test `cuda.compute` against the cu13 install"
+
+`cccl-python`. `pip install -e python/cuda_cccl[test-cu13]` then
+`ci/test_cuda_compute_python.sh`.
+
+> "I added a new Numba CUDA cooperative primitive under `cuda.coop._experimental`.
+> How do I wire up the tests?"
+
+`cccl-python`. Explains the `cuda_coop` test pattern, points at
+`ci/test_cuda_coop_python.sh`.
+
+---
+
+## 6. Performance
+
+> "Write a CUB benchmark for the new `DeviceThreeWayPartition` algorithm using
+> nvbench, with `%RANGE%` tuning annotations for items-per-thread"
+
+`cccl-bench` (nvbench-template reference). Generates per-variant `.cu` files with
+the shared `base.cuh` pattern.
+
+> "Request a CI bench run for this PR — focus on device_reduce and device_scan,
+> sm90 + sm120 GPUs only"
+
+`cccl-bench` (ci-bench-request reference). Edits `ci/bench.yaml` with the filters,
+appends `[bench-only]` to the commit message. Requires reset to template before merge.
+
+> "Compare perf of this branch vs main for `thrust::sort` on 1M..256M element keys"
+
+`cccl-bench` (local-run reference). Wraps `ci/bench/compare_git_refs.sh`.
+
+> "Sweep CUB's `BlockScan` tuning space for sm120 and pick a new policy"
+
+`cccl-bench` (tuning reference). Wraps the `cccl.bench` harness with
+`CUB_ENABLE_TUNING=ON`, generates `.variant` targets, sweeps, picks the optimum.
+
+> "Write a Python benchmark using `cuda.bench` for the new `cuda.compute.sort_pairs`
+> binding"
+
+`cccl-bench` + `cccl-python`. Python path uses `cuda.bench` with axis registration
+and `bench.run_all_benchmarks(sys.argv)`.
+
+---
+
+## 7. Infrastructure & release
+
+> "Bump the supported CUDA toolkit to 13.3"
+
+`cccl-infra` (ctk-bump playbook). Edits `ci/matrix.yaml` (`ctk_versions`,
+`devcontainer_version`, workflow rows), regenerates `.devcontainer/` via the
+matrix-aware generator, verifies the workflow expansion. Refuses to hand-edit
+individual `devcontainer.json` files.
+
+> "Add support for gcc 15 to the host compiler matrix"
+
+`cccl-infra` (compiler-bump playbook). Adds to `host_compilers`, cuda-specific
+version table, workflow rows, regenerates devcontainers.
+
+> "Cut a 3.2.0 release"
+
+`cccl-infra` (release-cut playbook). Drives `ci/update_version.sh`, version files
+per library (cub, thrust, libcudacxx, cudax), `cccl-version.json`,
+`docs/VERSION.md`, Python package, workflows. Never hand-edits version files.
+
+> "Add a new project under `c/parallel/` called `cccl-async` and wire it into CI"
+
+`cccl-infra` (project-add playbook). `ci/matrix.yaml` workflow rows + `jobs:`,
+`ci/project_files_and_dependencies.yaml` new key + deps, `CMakePresets.json`,
+build/test scripts. Touches every infra file the project needs.
+
+> "Pre-commit is failing — fix the formatting"
+
+`cccl-precommit`. Runs the suite, reviews diffs, stages fixed files, re-runs.
+Knows the auto-fix subset (clang-format, ruff, gersemi, end-of-file) vs the
+non-auto-fix subset (codespell, mypy, shellcheck).
+
+> "Build the docs locally"
+
+`cccl-docs`. Runs `./docs/gen_docs.bash` (Linux-only, builds Doxygen 1.9.6 first
+run, creates venv, runs Sphinx).
+
+> "My new header isn't showing up in the API docs"
+
+`cccl-docs` (doxygen-breathe-gotchas reference). Per-library Doxyfile inclusion
+patterns, Breathe bridge config, custom `_ext/auto_api_generator.py`.
+
+---
+
+## 8. Decision-point prompts
+
+> "I'm stuck — should I cherry-pick this fix onto `branch/3.1.x` or wait for the
+> next 3.2 release?"
+
+`cccl-clarify`. Three-step ladder: default reasoning from project conventions →
+check the release cadence and the bug severity → ask the user with framed
+options (cherry-pick / wait / hotfix release / break this down).
+
+> "I have a clang-format diff but also a real code change in the same hunk —
+> separate them?"
+
+`cccl-commit` + `cccl-clarify`. Surfaces the choice as part of the interactive
+chunk walkthrough.