Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions vero/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ __pycache__/
*.egg-info/
dist/
build/
# ...but the harbor compiler package is source, not a packaging artifact:
!src/vero/harbor/build/

# Testing
.pytest_cache/
Expand Down
16 changes: 16 additions & 0 deletions vero/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -525,6 +525,22 @@ agent = VeroAgent(
)
```

## Harbor integration

vero can compile an optimization run into a [Harbor](https://www.harborframework.com) task, so the *optimizer* itself becomes a Harbor agent-under-test: any Harbor agent (Claude Code, an oracle script, …) edits a target repo and spends an evaluation budget, and the reward is the best candidate's score on a hidden split. This makes optimization runs reproducible and leaderboard-gradeable — the optimizer can't read hidden labels, modify the scorer, or bypass its budget.

```bash
uv pip install 'scale-vero[harbor]'
vero harbor build -c build.yaml -o /tmp/opt-task # compile a Harbor task
vero harbor run -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker # build + run
```

Two evaluation modes: **Mode A** (vero runs inference + scoring against vero-side labels) and **Mode B** (evaluation is delegated to a *nested* `harbor run`, e.g. on Modal). See:

- [`docs/harbor/architecture.md`](docs/harbor/architecture.md) — what it is, the topology, and the leaderboard-integrity model.
- [`docs/harbor/tutorial.md`](docs/harbor/tutorial.md) — build and run a task end to end.
- [`examples/gsm8k-agent`](examples/gsm8k-agent) (Mode A) and [`examples/gaia-optimization`](examples/gaia-optimization) (Mode B).

## Examples

See [`examples/matmul-kernel/`](examples/matmul-kernel/) for a complete runnable example that optimizes a matrix multiply kernel for speed. It demonstrates eval-only mode, full optimization with VeroAgent or Claude Code, filesystem artifacts, and resource-based editing.
121 changes: 121 additions & 0 deletions vero/docs/harbor/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Harbor integration — architecture

The Harbor integration turns a **vero optimization run into a [Harbor](https://www.harborframework.com)
task**. The agent-under-test of that Harbor task is an *optimizer*: any Harbor agent
(Claude Code, an oracle script, …) edits a target repository and spends an evaluation
budget; the reward is the best candidate's score on a hidden test split.

This lets anyone optimize a coding agent with plain `harbor run`, and makes the result
leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or
bypass its budget.

```
harbor run -p <task> -a <optimizer> -m <model> -e <provider>
▼ one optimization trial (a Docker Compose environment):
┌────────────────────────┐ ┌────────────────────────────────────┐
│ main (optimizer bench) │ HTTP │ eval-sidecar (the evaluation engine) │
│ • target repo (rw*) │ ─────► │ • dataset + scorer + baseline repo │
│ • `vero harbor` client │ │ • budget ledger + creds │
│ • runs the -a optimizer│ │ • `vero harbor serve` (FastAPI) │
└────────────────────────┘ └────────────────────────────────────┘
│ (trial end, shared verifier) ▲
└── `vero harbor finalize` (admin token) ──┘ → /logs/verifier/reward.json
```

## The optimization loop

1. **`vero harbor build`** compiles a `build.yaml` into a Harbor task directory
(`environment/` compose + Dockerfiles, `instruction.md`, `tests/test.sh`), baking
the dataset, scorer, baseline repo, and a `ServeConfig`.
2. At trial start, **`main`** seeds the target repo onto a shared volume and applies
write-access rules; the **eval-sidecar** starts `vero harbor serve` and writes a
per-trial admin token.
3. The **optimizer** (the `-a` agent) edits the repo, commits, and calls
`vero harbor eval --split <train|validation>` to measure a commit. The sidecar
fetches that commit, evaluates it (metered against the budget), and returns an
**aggregate** score (never per-sample labels).
4. At trial end, Harbor runs `tests/test.sh` in `main` (shared verifier mode). It
reads the admin token and calls the sidecar's **`finalize`**: the sidecar selects
the winning commit and scores it on the **hidden** test split, producing the reward.

## Two evaluation modes

The seam is a single injection point on the `Evaluator` (`eval_strategy`):

- **Mode A — vero scores** (`task_project`/`task` + dataset). vero runs the agent's
inference and a vero scoring function against vero-side labels. Example:
[`examples/gsm8k-agent`](../../examples/gsm8k-agent).
- **Mode B — Harbor scores** (`HarborConfig`). Inference is delegated: for each
candidate, `HarborRunner` runs a *nested* `harbor run` of the agent on a set of
Harbor tasks (e.g. on Modal) and collates the verifier rewards. One Harbor task =
one sample. Example: [`examples/gaia-optimization`](../../examples/gaia-optimization).

Both modes share the same topology, trust boundary, budget, and verifier — only the
"produce sample results" step differs.

## Leaderboard integrity (the trust boundary)

The optimizer is untrusted. Integrity rests on a few mechanisms, all best-effort at
the OS/process level (a container escape is out of scope):

- **3-tier split visibility** (`SplitAccessLevel`): `visible` (aggregate + per-sample
results), `non_viewable` (aggregate score only — no labels), `no_access` (hidden;
never evaluable by the agent, never written to its volume).
- **Write-routing by tier**: the sidecar writes only the agent-permitted projection of
each result to the *agent-results* volume (read-only in `main`). Full results, the
dataset, the ledger, and creds live on the *admin* volume, **never** mounted to `main`.
- **Token-gated finalize**: `finalize` (selection + hidden-split scoring) requires an
admin token written `root:600` on a volume `main` mounts read-only. The optimizer
runs as a de-privileged user and cannot read it, so it cannot trigger scoring or
probe the test split; the verifier (root, shared mode) can.
- **Metered budget**: a durable `BudgetLedger` caps how much the agent can evaluate per
split. Admin (verifier) evaluations bypass the meter.
- **Commit transfer**: the sidecar `git fetch`es the agent's commit from the mounted
repo into its *own* repo with hooks disabled and `file://` (object copy, no
alternates), so the evaluated tree is fully owned by the sidecar and tamper-evident.
- **Protected scorer / write-access**: the scorer is sidecar-only; `read_only_paths`
in `build.yaml` are applied as unix perms in `main` before the optimizer runs.

### Why a sidecar + shared verifier

The evaluation engine, dataset, scorer, and creds live in a separate container so the
optimizer never shares a filesystem or process space with them. We use Harbor's
**shared verifier** (the env, including the sidecar, stays up during `tests/test.sh`)
so the verifier can reach the live engine over HTTP and stay the single source of
truth — avoiding shipping the repo/dataset/ledger into a fresh verifier container. The
agent/admin split is enforced by the `root:600` token rather than separate services.

## Component map

```
vero/harbor/
├── build/ `vero harbor build`: BuildConfig → Harbor task dir
│ ├── config.py BuildConfig (the build.yaml schema)
│ ├── compiler.py renders the task dir; bakes dataset/scorer/repo/ServeConfig
│ └── templates/ compose, two Dockerfiles, instruction.md, test.sh, seed.sh, solve.sh
├── serve.py `vero harbor serve`: assemble engine+sidecar+verifier from ServeConfig
├── app.py FastAPI surface: /eval /submit /status (agent), /finalize (admin)
├── server.py EvaluationSidecar: commit transfer + tier write-routing (transport-agnostic)
├── verifier.py Verifier: commit selection (submit | auto_best) + hidden-split scoring
├── auth.py per-trial admin token (generate / root:600 write / verify)
├── cli.py `vero harbor` group: build | run | serve | eval | submit | status | finalize
├── config.py HarborConfig (Mode B)
├── runner.py HarborRunner (Mode-B EvalStrategy): nested `harbor run` → collate
├── dataset.py Mode-B {split: [task_names]} partition → DatasetDict
└── protocol.py aggregate-safe wire types + the redaction of an Experiment

vero/evaluation/
├── engine.py EvaluationEngine: budget metering + the single evaluate() entry point
├── evaluator.py Evaluator: checkout + run; the eval_strategy seam (Mode A vs B)
└── strategy.py EvalStrategy protocol
```

The compiler↔sidecar contract is `ServeConfig` (baked as `environment/sidecar/serve.json`);
the optimizer↔sidecar contract is the HTTP API in `app.py` (+ the `vero harbor` CLI clients).

## See also

- [Tutorial](./tutorial.md) — build and run an optimization task end to end.
- [`examples/gsm8k-agent`](../../examples/gsm8k-agent) — Mode A.
- [`examples/gaia-optimization`](../../examples/gaia-optimization) — Mode B (nested Harbor on Modal).
134 changes: 134 additions & 0 deletions vero/docs/harbor/tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Harbor integration — tutorial

This walks through compiling a vero optimization run into a Harbor task and running it
with an optimizer agent. Read the [architecture](./architecture.md) first for the
concepts (modes, the trust boundary, the optimization loop).

## Install

```bash
uv pip install 'scale-vero[harbor]' # adds the `vero harbor` CLI
# the Harbor CLI itself is invoked via uvx; for Modal-backed inner runs use the extra:
uvx --from 'harbor[modal]' harbor --help
```

## 1. Write a `build.yaml`

A build config describes the optimization task: the repo to optimize, how candidates
are scored, the split tiers, the budget, and the reward.

### Mode A — vero runs inference + scoring

```yaml
name: myorg/gsm8k-opt
agent_repo: /path/to/gsm8k-agent # the repo the optimizer edits
mode: A
task: gsm8k # vero task name
task_module: gsm8k_agent.vero_tasks # module that registers it
dataset: /path/to/gsm8k-dataset # a saved DatasetDict (inputs + labels)

splits:
- { split: validation, access: non_viewable } # optimizer sees aggregate score only
- { split: test, access: no_access } # hidden; scored at the end
budgets:
- { split: validation, total_run_budget: 5 }
reward_mode: auto_best # best validation commit auto-selected
selection_split: validation
targets:
- { split: test, reward_key: reward }
read_only_paths:
- src/gsm8k_agent/vero_tasks # the scorer — optimizer may not edit it
secrets: [OPENAI_API_KEY, OPENAI_BASE_URL] # injected into the eval sidecar only
```

### Mode B — a nested `harbor run` scores (e.g. on Modal)

```yaml
name: myorg/gaia-opt
agent_repo: /path/to/gaia-agent
mode: B
harbor:
agent_import_path: "gaia_agent:GaiaAgent" # the agent inside agent_repo
task_source: gaia/gaia # Harbor registry benchmark (or a local dir)
environment: modal
model: openai/gpt-4o-mini # the inner agent's model
partition: # {split: [harbor task names]} — one task = one sample
train: [gaia/<id1>, gaia/<id2>, ...]
validation: [gaia/<id6>, gaia/<id7>, ...]
splits:
- { split: train, access: non_viewable }
- { split: validation, access: no_access }
budgets:
- { split: train, total_run_budget: 3 }
reward_mode: auto_best
selection_split: train
targets:
- { split: validation, reward_key: accuracy }
secrets: [OPENAI_API_KEY, OPENAI_BASE_URL, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET]
```

`secrets` are variable **names**: their values are read from your shell at run time and
injected into the eval sidecar only — never into the optimizer's container. The full
field list is in `vero/harbor/build/config.py` (`BuildConfig`).

## 2. Build the task

```bash
vero harbor build -c build.yaml -o /tmp/opt-task
```

This emits a Harbor task directory: `environment/` (a Docker Compose env = the optimizer
workbench `main` + the `eval-sidecar`, plus volumes), `instruction.md` (the protocol the
optimizer reads), and `tests/test.sh` (the verifier). The dataset/scorer/baseline repo
and the sidecar's `ServeConfig` are baked in.

## 3. Run it with an optimizer

Any Harbor agent can be the optimizer. Provide its creds in your shell (Harbor forwards
them into `main`); e.g. for `claude-code` set `ANTHROPIC_API_KEY` (+ `ANTHROPIC_BASE_URL`
if routing through a gateway).

```bash
# build + run in one step:
vero harbor run -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker

# or run a pre-built task dir:
uvx harbor run -p /tmp/opt-task -a claude-code -m claude-haiku-4-5 -e docker

# the `oracle` agent runs solution/solve.sh (a scripted optimizer) — handy for a smoke test:
uvx harbor run -p /tmp/opt-task -a oracle -e docker
```

The reward lands in the job's `verifier/reward.json` (e.g. `{"reward": 0.42}`), and Harbor
reports it as the trial reward.

## What the optimizer does (the agent-side protocol)

Inside `main`, the optimizer follows `instruction.md`. The `vero harbor` CLI talks to the
eval sidecar over `VERO_EVAL_URL` (set automatically):

```bash
vero harbor status # remaining budget, evaluable splits
# edit the repo, commit, then measure the current HEAD:
vero harbor eval --dataset-id <id> --split validation
vero harbor submit # (if reward_mode: submit) nominate the final commit
```

- `eval` returns an aggregate score + remaining budget; for `no_access` splits it is
rejected, and labels are never returned.
- With `reward_mode: auto_best`, the best commit on `selection_split` is chosen
automatically; with `submit`, the agent nominates one.
- The verifier scores the chosen commit on the hidden `targets` split at the end.

## Inspecting a run

```bash
uvx harbor view <jobs-dir> # browse trials
cat <jobs-dir>/*/*/verifier/reward.json
```

## Examples

- [`examples/gsm8k-agent`](../../examples/gsm8k-agent) — Mode A (vero scores gsm8k).
- [`examples/gaia-optimization`](../../examples/gaia-optimization) — Mode B (terminus on
GAIA via nested Harbor on Modal), with an editable-prompt optimization surface.
79 changes: 79 additions & 0 deletions vero/examples/gaia-optimization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# GAIA optimization example (Harbor Mode B)

This example shows the **vero ⇄ Harbor** integration optimizing a coding agent on a
real benchmark. An optimizer (e.g. Claude Code) edits a GAIA agent's prompt; each
candidate is scored by a **nested `harbor run`** of the agent on real
[GAIA](https://huggingface.co/datasets/gaia-benchmark/GAIA) tasks (on Modal). The
reward is accuracy on a hidden split.

This is "Mode B": vero does **no** inference itself — evaluation is delegated to a
nested Harbor run, and the reward comes from Harbor's verifier. (Contrast "Mode A",
e.g. [`../gsm8k-agent`](../gsm8k-agent), where vero runs inference and scoring directly.)

## What's here

```
gaia-optimization/
├── build.yaml # the optimization task definition (vero harbor build -c)
├── pyproject.toml # deps: harbor[modal]
└── src/gaia_agent/
├── agent.py # GaiaAgent(Terminus2): the editable agent
└── prompts/ # the OPTIMIZATION SURFACE — the optimizer edits these
├── terminus-json-plain.txt
└── terminus-xml-plain.txt
```

`GaiaAgent` subclasses Harbor's `Terminus2` and overrides only its prompt-template
path so the prompt is read from this package's editable `prompts/` directory. The
optimizer improves `prompts/terminus-json-plain.txt`; the terminal loop, tmux
session, and response parsing are reused from `Terminus2` unchanged.

## Prerequisites

- The `harbor` CLI (`uvx --from 'harbor[modal]' harbor ...`) and Docker (outer trial).
- A [Modal](https://modal.com) account for the inner GAIA runs:
`MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` in your shell env.
- An OpenAI-compatible LLM endpoint for the **inner** GAIA agent:
`OPENAI_API_KEY` (+ optional `OPENAI_BASE_URL` to point at a gateway). The model is
set in `build.yaml` (`harbor.model`, default `openai/gpt-4o-mini`).
- Creds for the **outer** optimizer agent, per that agent (e.g. `ANTHROPIC_API_KEY`
for `-a claude-code`). Harbor forwards these from your shell into the optimizer's
container; they are **not** shared with the eval sidecar.

Secrets are resolved from your shell at run time and injected into the eval sidecar
**only** (see `build.yaml`'s `secrets:` — those are variable *names*, not values).

## Run it

```bash
# install vero with the harbor extra
uv pip install 'scale-vero[harbor]'

# build the task, then run it with an optimizer of your choice
vero harbor build -c build.yaml -o /tmp/gaia-task
uvx harbor run -p /tmp/gaia-task -a claude-code -m claude-haiku-4-5 -e docker

# ...or build + run in one step:
vero harbor run -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker
```

The optimizer reads the task instruction, edits `src/gaia_agent/prompts/...`, commits,
and calls `vero harbor eval --split train` to measure candidates within its budget.
At the end, the best train commit is scored on the hidden `validation` split and the
accuracy is written to Harbor's `reward.json`.

## Notes

- **GAIA is hard.** A terminal agent solves only some tasks; expect low scores and
weak optimization signal on a 5-task subset. Increase the subset, pick easier tasks,
or use a stronger model for a more meaningful run.
- **Cost/time.** Each GAIA task is a full agent rollout on a Modal sandbox (minutes +
LLM tokens). The default budget keeps a run to a handful of nested evals.
- Pick your own task ids by enumerating the benchmark:
`python -c "import asyncio; from harbor.models.job.config import DatasetConfig as D; print(asyncio.run(D(name='gaia/gaia').get_task_configs()))"`

## Attribution

`src/gaia_agent/prompts/*.txt` are copied from Harbor's `terminus_2` agent
(© Harbor authors, Apache-2.0) so the prompt stays compatible with the parser
`GaiaAgent` inherits. They are included here as the editable optimization surface.
Loading