scaleapi · varunursekar · Jun 16, 2026
diff --git a/vero/.gitignore b/vero/.gitignore
@@ -11,6 +11,8 @@ __pycache__/
 *.egg-info/
 dist/
 build/
+# ...but the harbor compiler package is source, not a packaging artifact:
+!src/vero/harbor/build/
 
 # Testing
 .pytest_cache/

diff --git a/vero/README.md b/vero/README.md
@@ -525,6 +525,22 @@ agent = VeroAgent(
 )
 ```
 
+## Harbor integration
+
+vero can compile an optimization run into a [Harbor](https://www.harborframework.com) task, so the *optimizer* itself becomes a Harbor agent-under-test: any Harbor agent (Claude Code, an oracle script, …) edits a target repo and spends an evaluation budget, and the reward is the best candidate's score on a hidden split. This makes optimization runs reproducible and leaderboard-gradeable — the optimizer can't read hidden labels, modify the scorer, or bypass its budget.
+
+```bash
+uv pip install 'scale-vero[harbor]'
+vero harbor build -c build.yaml -o /tmp/opt-task        # compile a Harbor task
+vero harbor run   -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker   # build + run
+```
+
+Two evaluation modes: **Mode A** (vero runs inference + scoring against vero-side labels) and **Mode B** (evaluation is delegated to a *nested* `harbor run`, e.g. on Modal). See:
+
+- [`docs/harbor/architecture.md`](docs/harbor/architecture.md) — what it is, the topology, and the leaderboard-integrity model.
+- [`docs/harbor/tutorial.md`](docs/harbor/tutorial.md) — build and run a task end to end.
+- [`examples/gsm8k-agent`](examples/gsm8k-agent) (Mode A) and [`examples/gaia-optimization`](examples/gaia-optimization) (Mode B).
+
 ## Examples
 
 See [`examples/matmul-kernel/`](examples/matmul-kernel/) for a complete runnable example that optimizes a matrix multiply kernel for speed. It demonstrates eval-only mode, full optimization with VeroAgent or Claude Code, filesystem artifacts, and resource-based editing.
diff --git a/vero/docs/harbor/architecture.md b/vero/docs/harbor/architecture.md
@@ -0,0 +1,121 @@
+# Harbor integration — architecture
+
+The Harbor integration turns a **vero optimization run into a [Harbor](https://www.harborframework.com)
+task**. The agent-under-test of that Harbor task is an *optimizer*: any Harbor agent
+(Claude Code, an oracle script, …) edits a target repository and spends an evaluation
+budget; the reward is the best candidate's score on a hidden test split.
+
+This lets anyone optimize a coding agent with plain `harbor run`, and makes the result
+leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or
+bypass its budget.
+
+```
+harbor run -p <task> -a <optimizer> -m <model> -e <provider>
+        │
+        ▼  one optimization trial (a Docker Compose environment):
+  ┌────────────────────────┐        ┌────────────────────────────────────┐
+  │ main (optimizer bench)  │  HTTP  │ eval-sidecar (the evaluation engine) │
+  │  • target repo (rw*)    │ ─────► │  • dataset + scorer + baseline repo  │
+  │  • `vero harbor` client │        │  • budget ledger + creds             │
+  │  • runs the -a optimizer│        │  • `vero harbor serve` (FastAPI)     │
+  └────────────────────────┘        └────────────────────────────────────┘
+        │  (trial end, shared verifier)            ▲
+        └── `vero harbor finalize` (admin token) ──┘ → /logs/verifier/reward.json
+```
+
+## The optimization loop
+
+1. **`vero harbor build`** compiles a `build.yaml` into a Harbor task directory
+   (`environment/` compose + Dockerfiles, `instruction.md`, `tests/test.sh`), baking
+   the dataset, scorer, baseline repo, and a `ServeConfig`.
+2. At trial start, **`main`** seeds the target repo onto a shared volume and applies
+   write-access rules; the **eval-sidecar** starts `vero harbor serve` and writes a
+   per-trial admin token.
+3. The **optimizer** (the `-a` agent) edits the repo, commits, and calls
+   `vero harbor eval --split <train|validation>` to measure a commit. The sidecar
+   fetches that commit, evaluates it (metered against the budget), and returns an
+   **aggregate** score (never per-sample labels).
+4. At trial end, Harbor runs `tests/test.sh` in `main` (shared verifier mode). It
+   reads the admin token and calls the sidecar's **`finalize`**: the sidecar selects
+   the winning commit and scores it on the **hidden** test split, producing the reward.
+
+## Two evaluation modes
+
+The seam is a single injection point on the `Evaluator` (`eval_strategy`):
+
+- **Mode A — vero scores** (`task_project`/`task` + dataset). vero runs the agent's
+  inference and a vero scoring function against vero-side labels. Example:
+  [`examples/gsm8k-agent`](../../examples/gsm8k-agent).
+- **Mode B — Harbor scores** (`HarborConfig`). Inference is delegated: for each
+  candidate, `HarborRunner` runs a *nested* `harbor run` of the agent on a set of
+  Harbor tasks (e.g. on Modal) and collates the verifier rewards. One Harbor task =
+  one sample. Example: [`examples/gaia-optimization`](../../examples/gaia-optimization).
+
+Both modes share the same topology, trust boundary, budget, and verifier — only the
+"produce sample results" step differs.
+
+## Leaderboard integrity (the trust boundary)
+
+The optimizer is untrusted. Integrity rests on a few mechanisms, all best-effort at
+the OS/process level (a container escape is out of scope):
+
+- **3-tier split visibility** (`SplitAccessLevel`): `visible` (aggregate + per-sample
+  results), `non_viewable` (aggregate score only — no labels), `no_access` (hidden;
+  never evaluable by the agent, never written to its volume).
+- **Write-routing by tier**: the sidecar writes only the agent-permitted projection of
+  each result to the *agent-results* volume (read-only in `main`). Full results, the
+  dataset, the ledger, and creds live on the *admin* volume, **never** mounted to `main`.
+- **Token-gated finalize**: `finalize` (selection + hidden-split scoring) requires an
+  admin token written `root:600` on a volume `main` mounts read-only. The optimizer
+  runs as a de-privileged user and cannot read it, so it cannot trigger scoring or
+  probe the test split; the verifier (root, shared mode) can.
+- **Metered budget**: a durable `BudgetLedger` caps how much the agent can evaluate per
+  split. Admin (verifier) evaluations bypass the meter.
+- **Commit transfer**: the sidecar `git fetch`es the agent's commit from the mounted
+  repo into its *own* repo with hooks disabled and `file://` (object copy, no
+  alternates), so the evaluated tree is fully owned by the sidecar and tamper-evident.
+- **Protected scorer / write-access**: the scorer is sidecar-only; `read_only_paths`
+  in `build.yaml` are applied as unix perms in `main` before the optimizer runs.
+
+### Why a sidecar + shared verifier
+
+The evaluation engine, dataset, scorer, and creds live in a separate container so the
+optimizer never shares a filesystem or process space with them. We use Harbor's
+**shared verifier** (the env, including the sidecar, stays up during `tests/test.sh`)
+so the verifier can reach the live engine over HTTP and stay the single source of
+truth — avoiding shipping the repo/dataset/ledger into a fresh verifier container. The
+agent/admin split is enforced by the `root:600` token rather than separate services.
+
+## Component map
+
+```
+vero/harbor/
+├── build/            `vero harbor build`: BuildConfig → Harbor task dir
+│   ├── config.py       BuildConfig (the build.yaml schema)
+│   ├── compiler.py     renders the task dir; bakes dataset/scorer/repo/ServeConfig
+│   └── templates/      compose, two Dockerfiles, instruction.md, test.sh, seed.sh, solve.sh
+├── serve.py          `vero harbor serve`: assemble engine+sidecar+verifier from ServeConfig
+├── app.py            FastAPI surface: /eval /submit /status (agent), /finalize (admin)
+├── server.py         EvaluationSidecar: commit transfer + tier write-routing (transport-agnostic)
+├── verifier.py       Verifier: commit selection (submit | auto_best) + hidden-split scoring
+├── auth.py           per-trial admin token (generate / root:600 write / verify)
+├── cli.py            `vero harbor` group: build | run | serve | eval | submit | status | finalize
+├── config.py         HarborConfig (Mode B)
+├── runner.py         HarborRunner (Mode-B EvalStrategy): nested `harbor run` → collate
+├── dataset.py        Mode-B {split: [task_names]} partition → DatasetDict
+└── protocol.py       aggregate-safe wire types + the redaction of an Experiment
+
+vero/evaluation/
+├── engine.py         EvaluationEngine: budget metering + the single evaluate() entry point
+├── evaluator.py      Evaluator: checkout + run; the eval_strategy seam (Mode A vs B)
+└── strategy.py       EvalStrategy protocol
+```
+
+The compiler↔sidecar contract is `ServeConfig` (baked as `environment/sidecar/serve.json`);
+the optimizer↔sidecar contract is the HTTP API in `app.py` (+ the `vero harbor` CLI clients).
+
+## See also
+
+- [Tutorial](./tutorial.md) — build and run an optimization task end to end.
+- [`examples/gsm8k-agent`](../../examples/gsm8k-agent) — Mode A.
+- [`examples/gaia-optimization`](../../examples/gaia-optimization) — Mode B (nested Harbor on Modal).
diff --git a/vero/docs/harbor/tutorial.md b/vero/docs/harbor/tutorial.md
@@ -0,0 +1,134 @@
+# Harbor integration — tutorial
+
+This walks through compiling a vero optimization run into a Harbor task and running it
+with an optimizer agent. Read the [architecture](./architecture.md) first for the
+concepts (modes, the trust boundary, the optimization loop).
+
+## Install
+
+```bash
+uv pip install 'scale-vero[harbor]'          # adds the `vero harbor` CLI
+# the Harbor CLI itself is invoked via uvx; for Modal-backed inner runs use the extra:
+uvx --from 'harbor[modal]' harbor --help
+```
+
+## 1. Write a `build.yaml`
+
+A build config describes the optimization task: the repo to optimize, how candidates
+are scored, the split tiers, the budget, and the reward.
+
+### Mode A — vero runs inference + scoring
+
+```yaml
+name: myorg/gsm8k-opt
+agent_repo: /path/to/gsm8k-agent     # the repo the optimizer edits
+mode: A
+task: gsm8k                          # vero task name
+task_module: gsm8k_agent.vero_tasks  # module that registers it
+dataset: /path/to/gsm8k-dataset      # a saved DatasetDict (inputs + labels)
+
+splits:
+  - { split: validation, access: non_viewable }   # optimizer sees aggregate score only
+  - { split: test,       access: no_access }       # hidden; scored at the end
+budgets:
+  - { split: validation, total_run_budget: 5 }
+reward_mode: auto_best                # best validation commit auto-selected
+selection_split: validation
+targets:
+  - { split: test, reward_key: reward }
+read_only_paths:
+  - src/gsm8k_agent/vero_tasks        # the scorer — optimizer may not edit it
+secrets: [OPENAI_API_KEY, OPENAI_BASE_URL]   # injected into the eval sidecar only
+```
+
+### Mode B — a nested `harbor run` scores (e.g. on Modal)
+
+```yaml
+name: myorg/gaia-opt
+agent_repo: /path/to/gaia-agent
+mode: B
+harbor:
+  agent_import_path: "gaia_agent:GaiaAgent"   # the agent inside agent_repo
+  task_source: gaia/gaia                       # Harbor registry benchmark (or a local dir)
+  environment: modal
+  model: openai/gpt-4o-mini                    # the inner agent's model
+partition:                                     # {split: [harbor task names]} — one task = one sample
+  train:      [gaia/<id1>, gaia/<id2>, ...]
+  validation: [gaia/<id6>, gaia/<id7>, ...]
+splits:
+  - { split: train,      access: non_viewable }
+  - { split: validation, access: no_access }
+budgets:
+  - { split: train, total_run_budget: 3 }
+reward_mode: auto_best
+selection_split: train
+targets:
+  - { split: validation, reward_key: accuracy }
+secrets: [OPENAI_API_KEY, OPENAI_BASE_URL, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET]
+```
+
+`secrets` are variable **names**: their values are read from your shell at run time and
+injected into the eval sidecar only — never into the optimizer's container. The full
+field list is in `vero/harbor/build/config.py` (`BuildConfig`).
+
+## 2. Build the task
+
+```bash
+vero harbor build -c build.yaml -o /tmp/opt-task
+```
+
+This emits a Harbor task directory: `environment/` (a Docker Compose env = the optimizer
+workbench `main` + the `eval-sidecar`, plus volumes), `instruction.md` (the protocol the
+optimizer reads), and `tests/test.sh` (the verifier). The dataset/scorer/baseline repo
+and the sidecar's `ServeConfig` are baked in.
+
+## 3. Run it with an optimizer
+
+Any Harbor agent can be the optimizer. Provide its creds in your shell (Harbor forwards
+them into `main`); e.g. for `claude-code` set `ANTHROPIC_API_KEY` (+ `ANTHROPIC_BASE_URL`
+if routing through a gateway).
+
+```bash
+# build + run in one step:
+vero harbor run -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker
+
+# or run a pre-built task dir:
+uvx harbor run -p /tmp/opt-task -a claude-code -m claude-haiku-4-5 -e docker
+
+# the `oracle` agent runs solution/solve.sh (a scripted optimizer) — handy for a smoke test:
+uvx harbor run -p /tmp/opt-task -a oracle -e docker
+```
+
+The reward lands in the job's `verifier/reward.json` (e.g. `{"reward": 0.42}`), and Harbor
+reports it as the trial reward.
+
+## What the optimizer does (the agent-side protocol)
+
+Inside `main`, the optimizer follows `instruction.md`. The `vero harbor` CLI talks to the
+eval sidecar over `VERO_EVAL_URL` (set automatically):
+
+```bash
+vero harbor status                                   # remaining budget, evaluable splits
+# edit the repo, commit, then measure the current HEAD:
+vero harbor eval --dataset-id <id> --split validation
+vero harbor submit                                   # (if reward_mode: submit) nominate the final commit
+```
+
+- `eval` returns an aggregate score + remaining budget; for `no_access` splits it is
+  rejected, and labels are never returned.
+- With `reward_mode: auto_best`, the best commit on `selection_split` is chosen
+  automatically; with `submit`, the agent nominates one.
+- The verifier scores the chosen commit on the hidden `targets` split at the end.
+
+## Inspecting a run
+
+```bash
+uvx harbor view <jobs-dir>          # browse trials
+cat <jobs-dir>/*/*/verifier/reward.json
+```
+
+## Examples
+
+- [`examples/gsm8k-agent`](../../examples/gsm8k-agent) — Mode A (vero scores gsm8k).
+- [`examples/gaia-optimization`](../../examples/gaia-optimization) — Mode B (terminus on
+  GAIA via nested Harbor on Modal), with an editable-prompt optimization surface.
diff --git a/vero/examples/gaia-optimization/README.md b/vero/examples/gaia-optimization/README.md
@@ -0,0 +1,79 @@
+# GAIA optimization example (Harbor Mode B)
+
+This example shows the **vero ⇄ Harbor** integration optimizing a coding agent on a
+real benchmark. An optimizer (e.g. Claude Code) edits a GAIA agent's prompt; each
+candidate is scored by a **nested `harbor run`** of the agent on real
+[GAIA](https://huggingface.co/datasets/gaia-benchmark/GAIA) tasks (on Modal). The
+reward is accuracy on a hidden split.
+
+This is "Mode B": vero does **no** inference itself — evaluation is delegated to a
+nested Harbor run, and the reward comes from Harbor's verifier. (Contrast "Mode A",
+e.g. [`../gsm8k-agent`](../gsm8k-agent), where vero runs inference and scoring directly.)
+
+## What's here
+
+```
+gaia-optimization/
+├── build.yaml                         # the optimization task definition (vero harbor build -c)
+├── pyproject.toml                     # deps: harbor[modal]
+└── src/gaia_agent/
+    ├── agent.py                       # GaiaAgent(Terminus2): the editable agent
+    └── prompts/                       # the OPTIMIZATION SURFACE — the optimizer edits these
+        ├── terminus-json-plain.txt
+        └── terminus-xml-plain.txt
+```
+
+`GaiaAgent` subclasses Harbor's `Terminus2` and overrides only its prompt-template
+path so the prompt is read from this package's editable `prompts/` directory. The
+optimizer improves `prompts/terminus-json-plain.txt`; the terminal loop, tmux
+session, and response parsing are reused from `Terminus2` unchanged.
+
+## Prerequisites
+
+- The `harbor` CLI (`uvx --from 'harbor[modal]' harbor ...`) and Docker (outer trial).
+- A [Modal](https://modal.com) account for the inner GAIA runs:
+  `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` in your shell env.
+- An OpenAI-compatible LLM endpoint for the **inner** GAIA agent:
+  `OPENAI_API_KEY` (+ optional `OPENAI_BASE_URL` to point at a gateway). The model is
+  set in `build.yaml` (`harbor.model`, default `openai/gpt-4o-mini`).
+- Creds for the **outer** optimizer agent, per that agent (e.g. `ANTHROPIC_API_KEY`
+  for `-a claude-code`). Harbor forwards these from your shell into the optimizer's
+  container; they are **not** shared with the eval sidecar.
+
+Secrets are resolved from your shell at run time and injected into the eval sidecar
+**only** (see `build.yaml`'s `secrets:` — those are variable *names*, not values).
+
+## Run it
+
+```bash
+# install vero with the harbor extra
+uv pip install 'scale-vero[harbor]'
+
+# build the task, then run it with an optimizer of your choice
+vero harbor build -c build.yaml -o /tmp/gaia-task
+uvx harbor run -p /tmp/gaia-task -a claude-code -m claude-haiku-4-5 -e docker
+
+# ...or build + run in one step:
+vero harbor run -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker
+```
+
+The optimizer reads the task instruction, edits `src/gaia_agent/prompts/...`, commits,
+and calls `vero harbor eval --split train` to measure candidates within its budget.
+At the end, the best train commit is scored on the hidden `validation` split and the
+accuracy is written to Harbor's `reward.json`.
+
+## Notes
+
+- **GAIA is hard.** A terminal agent solves only some tasks; expect low scores and
+  weak optimization signal on a 5-task subset. Increase the subset, pick easier tasks,
+  or use a stronger model for a more meaningful run.
+- **Cost/time.** Each GAIA task is a full agent rollout on a Modal sandbox (minutes +
+  LLM tokens). The default budget keeps a run to a handful of nested evals.
+- Pick your own task ids by enumerating the benchmark:
+  `python -c "import asyncio; from harbor.models.job.config import DatasetConfig as D; print(asyncio.run(D(name='gaia/gaia').get_task_configs()))"`
+
+## Attribution
+
+`src/gaia_agent/prompts/*.txt` are copied from Harbor's `terminus_2` agent
+(© Harbor authors, Apache-2.0) so the prompt stays compatible with the parser
+`GaiaAgent` inherits. They are included here as the editable optimization surface.