diff --git a/vero/.gitignore b/vero/.gitignore index d8d3a3c..14d7b2c 100644 --- a/vero/.gitignore +++ b/vero/.gitignore @@ -11,6 +11,8 @@ __pycache__/ *.egg-info/ dist/ build/ +# ...but the harbor compiler package is source, not a packaging artifact: +!src/vero/harbor/build/ # Testing .pytest_cache/ diff --git a/vero/README.md b/vero/README.md index cddac72..eca1ff5 100644 --- a/vero/README.md +++ b/vero/README.md @@ -525,6 +525,22 @@ agent = VeroAgent( ) ``` +## Harbor integration + +vero can compile an optimization run into a [Harbor](https://www.harborframework.com) task, so the *optimizer* itself becomes a Harbor agent-under-test: any Harbor agent (Claude Code, an oracle script, …) edits a target repo and spends an evaluation budget, and the reward is the best candidate's score on a hidden split. This makes optimization runs reproducible and leaderboard-gradeable — the optimizer can't read hidden labels, modify the scorer, or bypass its budget. + +```bash +uv pip install 'scale-vero[harbor]' +vero harbor build -c build.yaml -o /tmp/opt-task # compile a Harbor task +vero harbor run -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker # build + run +``` + +Two evaluation modes: **Mode A** (vero runs inference + scoring against vero-side labels) and **Mode B** (evaluation is delegated to a *nested* `harbor run`, e.g. on Modal). See: + +- [`docs/harbor/architecture.md`](docs/harbor/architecture.md) — what it is, the topology, and the leaderboard-integrity model. +- [`docs/harbor/tutorial.md`](docs/harbor/tutorial.md) — build and run a task end to end. +- [`examples/gsm8k-agent`](examples/gsm8k-agent) (Mode A) and [`examples/gaia-optimization`](examples/gaia-optimization) (Mode B). + ## Examples See [`examples/matmul-kernel/`](examples/matmul-kernel/) for a complete runnable example that optimizes a matrix multiply kernel for speed. It demonstrates eval-only mode, full optimization with VeroAgent or Claude Code, filesystem artifacts, and resource-based editing. diff --git a/vero/docs/harbor/architecture.md b/vero/docs/harbor/architecture.md new file mode 100644 index 0000000..bde10f8 --- /dev/null +++ b/vero/docs/harbor/architecture.md @@ -0,0 +1,121 @@ +# Harbor integration — architecture + +The Harbor integration turns a **vero optimization run into a [Harbor](https://www.harborframework.com) +task**. The agent-under-test of that Harbor task is an *optimizer*: any Harbor agent +(Claude Code, an oracle script, …) edits a target repository and spends an evaluation +budget; the reward is the best candidate's score on a hidden test split. + +This lets anyone optimize a coding agent with plain `harbor run`, and makes the result +leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or +bypass its budget. + +``` +harbor run -p -a -m -e + │ + ▼ one optimization trial (a Docker Compose environment): + ┌────────────────────────┐ ┌────────────────────────────────────┐ + │ main (optimizer bench) │ HTTP │ eval-sidecar (the evaluation engine) │ + │ • target repo (rw*) │ ─────► │ • dataset + scorer + baseline repo │ + │ • `vero harbor` client │ │ • budget ledger + creds │ + │ • runs the -a optimizer│ │ • `vero harbor serve` (FastAPI) │ + └────────────────────────┘ └────────────────────────────────────┘ + │ (trial end, shared verifier) ▲ + └── `vero harbor finalize` (admin token) ──┘ → /logs/verifier/reward.json +``` + +## The optimization loop + +1. **`vero harbor build`** compiles a `build.yaml` into a Harbor task directory + (`environment/` compose + Dockerfiles, `instruction.md`, `tests/test.sh`), baking + the dataset, scorer, baseline repo, and a `ServeConfig`. +2. At trial start, **`main`** seeds the target repo onto a shared volume and applies + write-access rules; the **eval-sidecar** starts `vero harbor serve` and writes a + per-trial admin token. +3. The **optimizer** (the `-a` agent) edits the repo, commits, and calls + `vero harbor eval --split ` to measure a commit. The sidecar + fetches that commit, evaluates it (metered against the budget), and returns an + **aggregate** score (never per-sample labels). +4. At trial end, Harbor runs `tests/test.sh` in `main` (shared verifier mode). It + reads the admin token and calls the sidecar's **`finalize`**: the sidecar selects + the winning commit and scores it on the **hidden** test split, producing the reward. + +## Two evaluation modes + +The seam is a single injection point on the `Evaluator` (`eval_strategy`): + +- **Mode A — vero scores** (`task_project`/`task` + dataset). vero runs the agent's + inference and a vero scoring function against vero-side labels. Example: + [`examples/gsm8k-agent`](../../examples/gsm8k-agent). +- **Mode B — Harbor scores** (`HarborConfig`). Inference is delegated: for each + candidate, `HarborRunner` runs a *nested* `harbor run` of the agent on a set of + Harbor tasks (e.g. on Modal) and collates the verifier rewards. One Harbor task = + one sample. Example: [`examples/gaia-optimization`](../../examples/gaia-optimization). + +Both modes share the same topology, trust boundary, budget, and verifier — only the +"produce sample results" step differs. + +## Leaderboard integrity (the trust boundary) + +The optimizer is untrusted. Integrity rests on a few mechanisms, all best-effort at +the OS/process level (a container escape is out of scope): + +- **3-tier split visibility** (`SplitAccessLevel`): `visible` (aggregate + per-sample + results), `non_viewable` (aggregate score only — no labels), `no_access` (hidden; + never evaluable by the agent, never written to its volume). +- **Write-routing by tier**: the sidecar writes only the agent-permitted projection of + each result to the *agent-results* volume (read-only in `main`). Full results, the + dataset, the ledger, and creds live on the *admin* volume, **never** mounted to `main`. +- **Token-gated finalize**: `finalize` (selection + hidden-split scoring) requires an + admin token written `root:600` on a volume `main` mounts read-only. The optimizer + runs as a de-privileged user and cannot read it, so it cannot trigger scoring or + probe the test split; the verifier (root, shared mode) can. +- **Metered budget**: a durable `BudgetLedger` caps how much the agent can evaluate per + split. Admin (verifier) evaluations bypass the meter. +- **Commit transfer**: the sidecar `git fetch`es the agent's commit from the mounted + repo into its *own* repo with hooks disabled and `file://` (object copy, no + alternates), so the evaluated tree is fully owned by the sidecar and tamper-evident. +- **Protected scorer / write-access**: the scorer is sidecar-only; `read_only_paths` + in `build.yaml` are applied as unix perms in `main` before the optimizer runs. + +### Why a sidecar + shared verifier + +The evaluation engine, dataset, scorer, and creds live in a separate container so the +optimizer never shares a filesystem or process space with them. We use Harbor's +**shared verifier** (the env, including the sidecar, stays up during `tests/test.sh`) +so the verifier can reach the live engine over HTTP and stay the single source of +truth — avoiding shipping the repo/dataset/ledger into a fresh verifier container. The +agent/admin split is enforced by the `root:600` token rather than separate services. + +## Component map + +``` +vero/harbor/ +├── build/ `vero harbor build`: BuildConfig → Harbor task dir +│ ├── config.py BuildConfig (the build.yaml schema) +│ ├── compiler.py renders the task dir; bakes dataset/scorer/repo/ServeConfig +│ └── templates/ compose, two Dockerfiles, instruction.md, test.sh, seed.sh, solve.sh +├── serve.py `vero harbor serve`: assemble engine+sidecar+verifier from ServeConfig +├── app.py FastAPI surface: /eval /submit /status (agent), /finalize (admin) +├── server.py EvaluationSidecar: commit transfer + tier write-routing (transport-agnostic) +├── verifier.py Verifier: commit selection (submit | auto_best) + hidden-split scoring +├── auth.py per-trial admin token (generate / root:600 write / verify) +├── cli.py `vero harbor` group: build | run | serve | eval | submit | status | finalize +├── config.py HarborConfig (Mode B) +├── runner.py HarborRunner (Mode-B EvalStrategy): nested `harbor run` → collate +├── dataset.py Mode-B {split: [task_names]} partition → DatasetDict +└── protocol.py aggregate-safe wire types + the redaction of an Experiment + +vero/evaluation/ +├── engine.py EvaluationEngine: budget metering + the single evaluate() entry point +├── evaluator.py Evaluator: checkout + run; the eval_strategy seam (Mode A vs B) +└── strategy.py EvalStrategy protocol +``` + +The compiler↔sidecar contract is `ServeConfig` (baked as `environment/sidecar/serve.json`); +the optimizer↔sidecar contract is the HTTP API in `app.py` (+ the `vero harbor` CLI clients). + +## See also + +- [Tutorial](./tutorial.md) — build and run an optimization task end to end. +- [`examples/gsm8k-agent`](../../examples/gsm8k-agent) — Mode A. +- [`examples/gaia-optimization`](../../examples/gaia-optimization) — Mode B (nested Harbor on Modal). diff --git a/vero/docs/harbor/tutorial.md b/vero/docs/harbor/tutorial.md new file mode 100644 index 0000000..2cb95a6 --- /dev/null +++ b/vero/docs/harbor/tutorial.md @@ -0,0 +1,134 @@ +# Harbor integration — tutorial + +This walks through compiling a vero optimization run into a Harbor task and running it +with an optimizer agent. Read the [architecture](./architecture.md) first for the +concepts (modes, the trust boundary, the optimization loop). + +## Install + +```bash +uv pip install 'scale-vero[harbor]' # adds the `vero harbor` CLI +# the Harbor CLI itself is invoked via uvx; for Modal-backed inner runs use the extra: +uvx --from 'harbor[modal]' harbor --help +``` + +## 1. Write a `build.yaml` + +A build config describes the optimization task: the repo to optimize, how candidates +are scored, the split tiers, the budget, and the reward. + +### Mode A — vero runs inference + scoring + +```yaml +name: myorg/gsm8k-opt +agent_repo: /path/to/gsm8k-agent # the repo the optimizer edits +mode: A +task: gsm8k # vero task name +task_module: gsm8k_agent.vero_tasks # module that registers it +dataset: /path/to/gsm8k-dataset # a saved DatasetDict (inputs + labels) + +splits: + - { split: validation, access: non_viewable } # optimizer sees aggregate score only + - { split: test, access: no_access } # hidden; scored at the end +budgets: + - { split: validation, total_run_budget: 5 } +reward_mode: auto_best # best validation commit auto-selected +selection_split: validation +targets: + - { split: test, reward_key: reward } +read_only_paths: + - src/gsm8k_agent/vero_tasks # the scorer — optimizer may not edit it +secrets: [OPENAI_API_KEY, OPENAI_BASE_URL] # injected into the eval sidecar only +``` + +### Mode B — a nested `harbor run` scores (e.g. on Modal) + +```yaml +name: myorg/gaia-opt +agent_repo: /path/to/gaia-agent +mode: B +harbor: + agent_import_path: "gaia_agent:GaiaAgent" # the agent inside agent_repo + task_source: gaia/gaia # Harbor registry benchmark (or a local dir) + environment: modal + model: openai/gpt-4o-mini # the inner agent's model +partition: # {split: [harbor task names]} — one task = one sample + train: [gaia/, gaia/, ...] + validation: [gaia/, gaia/, ...] +splits: + - { split: train, access: non_viewable } + - { split: validation, access: no_access } +budgets: + - { split: train, total_run_budget: 3 } +reward_mode: auto_best +selection_split: train +targets: + - { split: validation, reward_key: accuracy } +secrets: [OPENAI_API_KEY, OPENAI_BASE_URL, MODAL_TOKEN_ID, MODAL_TOKEN_SECRET] +``` + +`secrets` are variable **names**: their values are read from your shell at run time and +injected into the eval sidecar only — never into the optimizer's container. The full +field list is in `vero/harbor/build/config.py` (`BuildConfig`). + +## 2. Build the task + +```bash +vero harbor build -c build.yaml -o /tmp/opt-task +``` + +This emits a Harbor task directory: `environment/` (a Docker Compose env = the optimizer +workbench `main` + the `eval-sidecar`, plus volumes), `instruction.md` (the protocol the +optimizer reads), and `tests/test.sh` (the verifier). The dataset/scorer/baseline repo +and the sidecar's `ServeConfig` are baked in. + +## 3. Run it with an optimizer + +Any Harbor agent can be the optimizer. Provide its creds in your shell (Harbor forwards +them into `main`); e.g. for `claude-code` set `ANTHROPIC_API_KEY` (+ `ANTHROPIC_BASE_URL` +if routing through a gateway). + +```bash +# build + run in one step: +vero harbor run -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker + +# or run a pre-built task dir: +uvx harbor run -p /tmp/opt-task -a claude-code -m claude-haiku-4-5 -e docker + +# the `oracle` agent runs solution/solve.sh (a scripted optimizer) — handy for a smoke test: +uvx harbor run -p /tmp/opt-task -a oracle -e docker +``` + +The reward lands in the job's `verifier/reward.json` (e.g. `{"reward": 0.42}`), and Harbor +reports it as the trial reward. + +## What the optimizer does (the agent-side protocol) + +Inside `main`, the optimizer follows `instruction.md`. The `vero harbor` CLI talks to the +eval sidecar over `VERO_EVAL_URL` (set automatically): + +```bash +vero harbor status # remaining budget, evaluable splits +# edit the repo, commit, then measure the current HEAD: +vero harbor eval --dataset-id --split validation +vero harbor submit # (if reward_mode: submit) nominate the final commit +``` + +- `eval` returns an aggregate score + remaining budget; for `no_access` splits it is + rejected, and labels are never returned. +- With `reward_mode: auto_best`, the best commit on `selection_split` is chosen + automatically; with `submit`, the agent nominates one. +- The verifier scores the chosen commit on the hidden `targets` split at the end. + +## Inspecting a run + +```bash +uvx harbor view # browse trials +cat /*/*/verifier/reward.json +``` + +## Examples + +- [`examples/gsm8k-agent`](../../examples/gsm8k-agent) — Mode A (vero scores gsm8k). +- [`examples/gaia-optimization`](../../examples/gaia-optimization) — Mode B (terminus on + GAIA via nested Harbor on Modal), with an editable-prompt optimization surface. diff --git a/vero/examples/gaia-optimization/README.md b/vero/examples/gaia-optimization/README.md new file mode 100644 index 0000000..2d2450f --- /dev/null +++ b/vero/examples/gaia-optimization/README.md @@ -0,0 +1,79 @@ +# GAIA optimization example (Harbor Mode B) + +This example shows the **vero ⇄ Harbor** integration optimizing a coding agent on a +real benchmark. An optimizer (e.g. Claude Code) edits a GAIA agent's prompt; each +candidate is scored by a **nested `harbor run`** of the agent on real +[GAIA](https://huggingface.co/datasets/gaia-benchmark/GAIA) tasks (on Modal). The +reward is accuracy on a hidden split. + +This is "Mode B": vero does **no** inference itself — evaluation is delegated to a +nested Harbor run, and the reward comes from Harbor's verifier. (Contrast "Mode A", +e.g. [`../gsm8k-agent`](../gsm8k-agent), where vero runs inference and scoring directly.) + +## What's here + +``` +gaia-optimization/ +├── build.yaml # the optimization task definition (vero harbor build -c) +├── pyproject.toml # deps: harbor[modal] +└── src/gaia_agent/ + ├── agent.py # GaiaAgent(Terminus2): the editable agent + └── prompts/ # the OPTIMIZATION SURFACE — the optimizer edits these + ├── terminus-json-plain.txt + └── terminus-xml-plain.txt +``` + +`GaiaAgent` subclasses Harbor's `Terminus2` and overrides only its prompt-template +path so the prompt is read from this package's editable `prompts/` directory. The +optimizer improves `prompts/terminus-json-plain.txt`; the terminal loop, tmux +session, and response parsing are reused from `Terminus2` unchanged. + +## Prerequisites + +- The `harbor` CLI (`uvx --from 'harbor[modal]' harbor ...`) and Docker (outer trial). +- A [Modal](https://modal.com) account for the inner GAIA runs: + `MODAL_TOKEN_ID` / `MODAL_TOKEN_SECRET` in your shell env. +- An OpenAI-compatible LLM endpoint for the **inner** GAIA agent: + `OPENAI_API_KEY` (+ optional `OPENAI_BASE_URL` to point at a gateway). The model is + set in `build.yaml` (`harbor.model`, default `openai/gpt-4o-mini`). +- Creds for the **outer** optimizer agent, per that agent (e.g. `ANTHROPIC_API_KEY` + for `-a claude-code`). Harbor forwards these from your shell into the optimizer's + container; they are **not** shared with the eval sidecar. + +Secrets are resolved from your shell at run time and injected into the eval sidecar +**only** (see `build.yaml`'s `secrets:` — those are variable *names*, not values). + +## Run it + +```bash +# install vero with the harbor extra +uv pip install 'scale-vero[harbor]' + +# build the task, then run it with an optimizer of your choice +vero harbor build -c build.yaml -o /tmp/gaia-task +uvx harbor run -p /tmp/gaia-task -a claude-code -m claude-haiku-4-5 -e docker + +# ...or build + run in one step: +vero harbor run -c build.yaml -a claude-code -m claude-haiku-4-5 -e docker +``` + +The optimizer reads the task instruction, edits `src/gaia_agent/prompts/...`, commits, +and calls `vero harbor eval --split train` to measure candidates within its budget. +At the end, the best train commit is scored on the hidden `validation` split and the +accuracy is written to Harbor's `reward.json`. + +## Notes + +- **GAIA is hard.** A terminal agent solves only some tasks; expect low scores and + weak optimization signal on a 5-task subset. Increase the subset, pick easier tasks, + or use a stronger model for a more meaningful run. +- **Cost/time.** Each GAIA task is a full agent rollout on a Modal sandbox (minutes + + LLM tokens). The default budget keeps a run to a handful of nested evals. +- Pick your own task ids by enumerating the benchmark: + `python -c "import asyncio; from harbor.models.job.config import DatasetConfig as D; print(asyncio.run(D(name='gaia/gaia').get_task_configs()))"` + +## Attribution + +`src/gaia_agent/prompts/*.txt` are copied from Harbor's `terminus_2` agent +(© Harbor authors, Apache-2.0) so the prompt stays compatible with the parser +`GaiaAgent` inherits. They are included here as the editable optimization surface. diff --git a/vero/examples/gaia-optimization/build.yaml b/vero/examples/gaia-optimization/build.yaml new file mode 100644 index 0000000..4398e3e --- /dev/null +++ b/vero/examples/gaia-optimization/build.yaml @@ -0,0 +1,63 @@ +# `vero harbor build -c build.yaml -o ` compiles this into a Harbor task. +# Then: `harbor run -p -a -m -e docker` +# (or use `vero harbor run -c build.yaml -a ...` to build + run in one step). +# +# Mode B: the optimizer edits the GaiaAgent prompt in this repo; the eval sidecar +# scores candidates by a *nested* `harbor run` of the agent on real GAIA tasks +# (here on Modal). Reward = accuracy on the hidden `validation` split. + +name: examples/gaia-optimization +description: Optimize a terminus-2 GAIA agent's prompt; reward = accuracy on hidden GAIA tasks. + +agent_repo: . # this directory (the GaiaAgent the optimizer edits) +mode: B + +harbor: + agent_import_path: "gaia_agent:GaiaAgent" + task_source: gaia/gaia # Harbor registry benchmark (enumerated below) + environment: modal # inner provider; needs MODAL_TOKEN_ID/SECRET (see README) + # The inner GAIA agent's model. `openai/` routes via OPENAI_BASE_URL, so this + # works against OpenAI directly or any OpenAI-compatible gateway you point it at. + model: openai/gpt-4o-mini + max_retries: 1 + +# A small subset of gaia/gaia: 5 tasks the optimizer may measure on (train) and +# 5 held-out tasks scored once at the end (validation). Swap in your own ids; +# list them with: DatasetConfig(name="gaia/gaia").get_task_configs() +partition: + train: + - gaia/00d579ea-0889-4fd9-a771-2c8d79835c8d + - gaia/023e9d44-96ae-4eed-b912-244ee8c3b994 + - gaia/0383a3ee-47a7-41a4-b493-519bdefe0488 + - gaia/04a04a9b-226c-43fd-b319-d5e89743676f + - gaia/0512426f-4d28-49f0-be77-06d05daec096 + validation: + - gaia/05407167-39ec-4d3a-a234-73a9120c325d + - gaia/076c8171-9b3b-49b9-a477-244d2a532826 + - gaia/08c0b6e9-1b43-4c2e-ae55-4e3fce2c2715 + - gaia/08cae58d-4084-4616-b6dd-dd6534e4825b + - gaia/08f3a05f-5947-4089-a4c4-d4bcfaa6b7a0 + +splits: + - { split: train, access: non_viewable } # optimizer sees aggregate scores only + - { split: validation, access: no_access } # hidden; never reaches the optimizer + +budgets: + - { split: train, total_run_budget: 3 } # up to 3 measured evals of the train split + +reward_mode: auto_best # best train commit is auto-selected +selection_split: train +targets: + - { split: validation, reward_key: accuracy } + +# Secrets are resolved from your shell env and injected into the eval sidecar only, +# never into the optimizer's container. These are variable NAMES, not values. +secrets: + - OPENAI_API_KEY + - OPENAI_BASE_URL + - MODAL_TOKEN_ID + - MODAL_TOKEN_SECRET + +timeout: 3600 +sample_timeout: 900 +max_concurrency: 5 diff --git a/vero/examples/gaia-optimization/pyproject.toml b/vero/examples/gaia-optimization/pyproject.toml new file mode 100644 index 0000000..ad04d9a --- /dev/null +++ b/vero/examples/gaia-optimization/pyproject.toml @@ -0,0 +1,18 @@ +[project] +name = "gaia-agent" +version = "0.1.0" +description = "A GAIA agent (terminus-2 + editable prompt) optimized via the vero Harbor integration." +requires-python = ">=3.12" # harbor[modal] requires Python 3.12+ +dependencies = [ + "harbor[modal]>=0.13", +] + +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" + +[tool.hatch.build.targets.wheel] +packages = ["src/gaia_agent"] + +[tool.hatch.build.targets.wheel.force-include] +"src/gaia_agent/prompts" = "gaia_agent/prompts" diff --git a/vero/examples/gaia-optimization/src/gaia_agent/__init__.py b/vero/examples/gaia-optimization/src/gaia_agent/__init__.py new file mode 100644 index 0000000..1da8a40 --- /dev/null +++ b/vero/examples/gaia-optimization/src/gaia_agent/__init__.py @@ -0,0 +1,3 @@ +from gaia_agent.agent import GaiaAgent + +__all__ = ["GaiaAgent"] diff --git a/vero/examples/gaia-optimization/src/gaia_agent/agent.py b/vero/examples/gaia-optimization/src/gaia_agent/agent.py new file mode 100644 index 0000000..205191c --- /dev/null +++ b/vero/examples/gaia-optimization/src/gaia_agent/agent.py @@ -0,0 +1,38 @@ +"""A GAIA optimization target: Harbor's terminus-2 with an editable prompt. + +``GaiaAgent`` subclasses Harbor's ``Terminus2`` and points its prompt template at +this package's ``prompts/`` directory instead of the copy baked into the harbor +package. That makes the prompt the *optimization surface*: an optimizer (e.g. +Claude Code, driving ``vero harbor eval``) edits ``prompts/terminus-json-plain.txt`` +to improve the agent's GAIA score, while the terminal loop, tmux session, and +response parsing are reused unchanged from ``Terminus2``. + +The agent runs in the Harbor orchestrator process (where the LLM creds live) and +drives the task sandbox via ``environment.exec``; see the example README. +""" + +from __future__ import annotations + +from pathlib import Path + +from harbor.agents.terminus_2.terminus_2 import Terminus2 + +_PROMPTS = Path(__file__).parent / "prompts" + + +class GaiaAgent(Terminus2): + """Terminus-2 with its prompt sourced from this package's editable ``prompts/``.""" + + @staticmethod + def name() -> str: + return "gaia-agent" + + def version(self) -> str: + return "0.1.0" + + def _get_prompt_template_path(self) -> Path: + if self._parser_name == "json": + return _PROMPTS / "terminus-json-plain.txt" + if self._parser_name == "xml": + return _PROMPTS / "terminus-xml-plain.txt" + return super()._get_prompt_template_path() diff --git a/vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-json-plain.txt b/vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-json-plain.txt new file mode 100644 index 0000000..6481d56 --- /dev/null +++ b/vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-json-plain.txt @@ -0,0 +1,54 @@ +You are an AI assistant tasked with solving command-line tasks in a Linux environment. You will be given a task description and the output from previously executed commands. Your goal is to solve the task by providing batches of shell commands. + +Format your response as JSON with the following structure: + +{{ + "analysis": "Analyze the current state based on the terminal output provided. What do you see? What has been accomplished? What still needs to be done?", + "plan": "Describe your plan for the next steps. What commands will you run and why? Be specific about what you expect each command to accomplish.", + "commands": [ + {{ + "keystrokes": "ls -la\n", + "duration": 0.1 + }}, + {{ + "keystrokes": "cd project\n", + "duration": 0.1 + }} + ], + "task_complete": true +}} + +Required fields: +- "analysis": Your analysis of the current situation +- "plan": Your plan for the next steps +- "commands": Array of command objects to execute + +Optional fields: +- "task_complete": Boolean indicating if the task is complete (defaults to false if not present) + +Command object structure: +- "keystrokes": String containing the exact keystrokes to send to the terminal (required) +- "duration": Number of seconds to wait for the command to complete before the next command will be executed (defaults to 1.0 if not present) + +IMPORTANT: The text inside "keystrokes" will be used completely verbatim as keystrokes. Write commands exactly as you want them sent to the terminal: +- You must end every command with a newline (\n) or it will not execute. +- For special key sequences, use tmux-style escape sequences: + - C-c for Ctrl+C + - C-d for Ctrl+D + +The "duration" attribute specifies the number of seconds to wait for the command to complete (default: 1.0) before the next command will be executed. On immediate tasks (e.g., cd, ls, echo, cat) set a duration of 0.1 seconds. On commands (e.g., gcc, find, rustc) set a duration of 1.0 seconds. On slow commands (e.g., make, python3 [long running script], wget [file]) set an appropriate duration as you determine necessary. + +It is better to set a smaller duration than a longer duration. It is always possible to wait again if the prior output has not finished, by running {{"keystrokes": "", "duration": 10.0}} on subsequent requests to wait longer. Never wait longer than 60 seconds; prefer to poll to see intermediate result status. + +Important notes: +- Each command's keystrokes are sent exactly as written to the terminal +- Do not include extra whitespace before or after the keystrokes unless it's part of the intended command +- Extra text before or after the JSON will generate warnings but be tolerated +- The JSON must be valid - use proper escaping for quotes and special characters within strings +- Commands array can be empty if you want to wait without taking action + +Task Description: +{instruction} + +Current terminal state: +{terminal_state} diff --git a/vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-xml-plain.txt b/vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-xml-plain.txt new file mode 100644 index 0000000..6386356 --- /dev/null +++ b/vero/examples/gaia-optimization/src/gaia_agent/prompts/terminus-xml-plain.txt @@ -0,0 +1,60 @@ +You are an AI assistant tasked with solving command-line tasks in a Linux environment. You will be given a task description and the output from previously executed commands. Your goal is to solve the task by providing batches of shell commands. + +Format your response as XML with the following structure: + + + +Analyze the current state based on the terminal output provided. What do you see? What has been accomplished? What still needs to be done? + + +Describe your plan for the next steps. What commands will you run and why? Be specific about what you expect each command to accomplish. + + +ls -la + +cd project + + +true + + +Required sections: +- : Your analysis of the current situation +- : Your plan for the next steps +- : XML structure containing commands to execute + +The `duration` attribute of specifies the number of seconds to wait for the command to complete (default: 1.0) before the next command will be executed. On immediate tasks (e.g., cd, ls, echo, cat) set a duration of 0.1 seconds. On commands (e.g., gcc, find, rustc) set a duration of 1.0 seconds. On slow commands (e.g., make, python3 [long running script], wget [file]) set an apprpriate duration as you determine necessary. + +It is better to set a smaller duration than a longer duration. In is always possible to wait again if the prior output has not finished, by running on subsequent requests to wait longer. Never wait longer than 60 seconds; prefer to poll to see intermediate result status. + +Optional sections: +- : Include this tag if the task is complete. Can be: + - true (task complete) + - false (task not complete) + - (self-closing, equivalent to false) + - (empty, equivalent to false) + - If not present, task is assumed not complete + +IMPORTANT: The text inside each tag will be used completely verbatim as keystrokes. DO NOT XML-encode special characters - write them directly: +- Use < and > directly, NOT < and > +- Use & directly, NOT & +- Use quotes directly, NOT " +Even though this is XML, the content inside keystrokes tags is treated as raw text and sent exactly as written. Ensure there is no extra leading or trailing whitespace unless intended. You must end every command with a newline (\n) or it will not execute. + + +Special key sequences (use tmux-style escape sequences): +- C-c for Ctrl+C. MUST be sent as a keystroke by itself, e.g., C-c +- C-d for Ctrl+D. MUST be sent as a keystroke by itself, e.g., C-d +- For Enter/newline: simply add a newline (line break) in the XML, everything inside the command tag will be sent byte-for-byte + +Important notes: +- Each command's text content is sent exactly as keystrokes to the terminal +- Do not include extra whitespace before or after the command text unless it's part of the intended command +- Avoid extra text before or after the tags +- Avoid additional XML tags outside of analysis/plan/commands/task_complete + +Task Description: +{instruction} + +Current terminal state: +{terminal_state} diff --git a/vero/pyproject.toml b/vero/pyproject.toml index 5e182cc..6a88415 100644 --- a/vero/pyproject.toml +++ b/vero/pyproject.toml @@ -14,7 +14,9 @@ dependencies = [ "datasets>=4.3.0", "pydantic>=2.11.7", "python-dotenv>=1.2.2", + "pyyaml>=6.0", "requests>=2.32.5", + "rich>=13.9.4", "s3fs>=2025.9.0", "tenacity>=9.1.2", "toml>=0.10.2", @@ -37,6 +39,12 @@ docker = [ claude = [ "claude-agent-sdk>=0.1.56", ] +harbor = [ + "fastapi>=0.110", + "uvicorn>=0.27", + "httpx>=0.27", + "jinja2>=3.1.6", +] optimize = [ "async-lru>=2.0.5", "beautifulsoup4>=4.14.2", diff --git a/vero/src/vero/core/budget.py b/vero/src/vero/core/budget.py new file mode 100644 index 0000000..3d370f5 --- /dev/null +++ b/vero/src/vero/core/budget.py @@ -0,0 +1,187 @@ +"""Per-split evaluation budgets and the ledger that meters them. + +``SplitBudget`` is the public, stateful budget for one (split, dataset_id) pair. +``BudgetLedger`` owns a set of them — the keys also form the allowlist of +evaluable combinations. The ledger is in-memory by default (the in-process +``ExperimentRunnerTool``); with a ``persist_path`` it flushes every mutation to +durable JSON under a single-writer lock (the Harbor eval sidecar). +""" + +from __future__ import annotations + +import asyncio +import json +import logging +from dataclasses import dataclass, field +from pathlib import Path + +from vero.exceptions import ExperimentBudgetExceeded, InvalidSplitError + +logger = logging.getLogger(__name__) + + +@dataclass +class SplitBudget: + """A stateful object that tracks the remaining budget for running experiments.""" + + split: str + dataset_id: str = "" + total_sample_budget: int | None = None + remaining_sample_budget: int | None = field(init=False) + total_run_budget: int | None = None + remaining_run_budget: int | None = field(init=False) + max_samples_per_run: int | None = None + + def __repr__(self) -> str: + repr_items = [ + ("split", self.split), + ("dataset_id", self.dataset_id), + ("total_sample_budget", self.total_sample_budget), + ("total_run_budget", self.total_run_budget), + ] + repr_items = [item for item in repr_items if item[1] is not None] + return ( + f"SplitBudget({', '.join([f'{item[0]}={item[1]}' for item in repr_items])})" + ) + + def __post_init__(self): + assert ( + self.total_sample_budget is not None or self.total_run_budget is not None + ), "Either total sample budget or total run budget must be provided." + self.remaining_sample_budget = self.total_sample_budget + self.remaining_run_budget = self.total_run_budget + + assert ( + isinstance(self.total_sample_budget, int) + or self.total_sample_budget is None + ) + assert isinstance(self.total_run_budget, int) or self.total_run_budget is None + assert ( + isinstance(self.max_samples_per_run, int) + or self.max_samples_per_run is None + ) + + def has_run_budget(self) -> bool: + return self.remaining_run_budget is None or self.remaining_run_budget > 0 + + def decrement_run_budget(self) -> None: + if self.remaining_run_budget is not None: + self.remaining_run_budget -= 1 + + def has_sample_budget(self, num_samples: int) -> bool: + return ( + self.remaining_sample_budget is None + or self.remaining_sample_budget >= num_samples + ) + + def decrement_sample_budget(self, num_samples: int) -> None: + if self.remaining_sample_budget is not None: + self.remaining_sample_budget -= num_samples + + def exceeds_per_run_budget(self, num_samples: int) -> bool: + return ( + self.max_samples_per_run is not None + and num_samples > self.max_samples_per_run + ) + + +class BudgetLedger: + """Meters evaluation budget across (split, dataset_id) pairs. + + The keys are also the allowlist of evaluable combinations: a pair with no + budget entry is rejected by ``validate``. + + In-memory by default. Pass ``persist_path`` for the durable, crash-safe + variant used by the Harbor sidecar — every mutation is flushed under a + single-writer lock, and ``reserve`` checks-and-decrements atomically before a + run so concurrent callers cannot overspend. Budget is never refunded on error. + """ + + def __init__( + self, + budgets: list[SplitBudget] | None = None, + *, + persist_path: Path | str | None = None, + ): + self._budgets: dict[tuple[str, str], SplitBudget] = { + (b.split, b.dataset_id): b for b in (budgets or []) + } + self.persist_path = Path(persist_path) if persist_path else None + self._lock = asyncio.Lock() + + def validate(self, dataset_id: str, split: str) -> None: + """Raise if (split, dataset_id) is not an allowed combination.""" + if (split, dataset_id) not in self._budgets: + allowed_keys = list(self._budgets.keys()) + raise InvalidSplitError( + f"No split budget found for the combination (dataset_id={dataset_id}, split={split}) " + f"either because it does not exist or because it is not allowed. " + f"Allowed combinations: {allowed_keys}" + ) + + def get(self, dataset_id: str, split: str) -> SplitBudget: + """Return the budget for a pair (validates membership first).""" + self.validate(dataset_id, split) + return self._budgets[(split, dataset_id)] + + def check(self, dataset_id: str, split: str, num_samples: int) -> None: + """Raise ExperimentBudgetExceeded if the request would exceed the budget.""" + budget = self.get(dataset_id, split) + if not budget.has_run_budget(): + raise ExperimentBudgetExceeded( + f"No runs left for the {split} split of the {dataset_id} dataset." + ) + if not budget.has_sample_budget(num_samples): + raise ExperimentBudgetExceeded( + f"Requested {num_samples} samples for the {split} split of the {dataset_id} dataset, " + f"but the remaining sample budget only allows for {budget.remaining_sample_budget} samples." + ) + if budget.exceeds_per_run_budget(num_samples): + raise ExperimentBudgetExceeded( + f"Requested {num_samples} samples for the {split} split of the {dataset_id} dataset, " + f"but only {budget.max_samples_per_run} are allowed per run." + ) + + def record(self, dataset_id: str, split: str, num_samples: int) -> SplitBudget: + """Decrement the budget for a completed (or attempted) run and flush.""" + budget = self.get(dataset_id, split) + budget.decrement_sample_budget(num_samples) + budget.decrement_run_budget() + self._flush() + return budget + + async def reserve( + self, dataset_id: str, split: str, num_samples: int + ) -> SplitBudget: + """Atomically check + record before a run (durable, single-writer). + + Raises InvalidSplitError / ExperimentBudgetExceeded *before* decrementing, + so a rejected request costs nothing; a reserved request is never refunded. + """ + async with self._lock: + self.check(dataset_id, split, num_samples) + return self.record(dataset_id, split, num_samples) + + def status(self) -> dict[tuple[str, str], SplitBudget]: + """Return all budgets keyed by (split, dataset_id).""" + return dict(self._budgets) + + def _flush(self) -> None: + if self.persist_path is None: + return + data = [ + { + "split": b.split, + "dataset_id": b.dataset_id, + "total_sample_budget": b.total_sample_budget, + "remaining_sample_budget": b.remaining_sample_budget, + "total_run_budget": b.total_run_budget, + "remaining_run_budget": b.remaining_run_budget, + "max_samples_per_run": b.max_samples_per_run, + } + for b in self._budgets.values() + ] + self.persist_path.parent.mkdir(parents=True, exist_ok=True) + tmp = self.persist_path.with_suffix(self.persist_path.suffix + ".tmp") + tmp.write_text(json.dumps(data, indent=2)) + tmp.replace(self.persist_path) diff --git a/vero/src/vero/core/cli.py b/vero/src/vero/core/cli.py index 419ec80..ebc2293 100644 --- a/vero/src/vero/core/cli.py +++ b/vero/src/vero/core/cli.py @@ -83,6 +83,16 @@ def main(): setup_logging() +# Optional `vero harbor` group (requires the `harbor` extra). Registered lazily so the +# base CLI works without it. +try: + from vero.harbor.cli import harbor as _harbor_group + + main.add_command(_harbor_group) +except ImportError: + pass + + @main.group() def init(): """Initialize evaluation scaffolds for your uv project.""" @@ -578,7 +588,7 @@ def check( if errors: click.echo("\n Skipping task discovery (project issues above)") else: - from vero.evaluator import Evaluator + from vero.evaluation.evaluator import Evaluator from vero.workspace.git import GitWorkspace async def _discover(): @@ -760,7 +770,7 @@ def evaluate( """Run an evaluation on an agent codebase.""" import asyncio - from vero.evaluator import run_evaluation + from vero.evaluation.evaluator import run_evaluation asyncio.run( run_evaluation( diff --git a/vero/src/vero/core/dataset/base.py b/vero/src/vero/core/dataset/base.py index d9cf3ef..0b5c0bf 100644 --- a/vero/src/vero/core/dataset/base.py +++ b/vero/src/vero/core/dataset/base.py @@ -19,10 +19,17 @@ class DefaultSplitNames(StrEnum): class SplitAccessLevel(StrEnum): - """Access levels for dataset splits.""" + """Access levels for dataset splits. + + Three tiers of increasing restriction: + - viewable: rows materialized + full per-sample results visible. + - non_viewable: no rows, but the split can be evaluated and summary stats seen. + - no_access: no rows, no summary, and not agent-evaluable (admin/verifier only). + """ viewable = "viewable" non_viewable = "non_viewable" + no_access = "no_access" @dataclass @@ -40,17 +47,28 @@ def viewable(cls, split: str) -> SplitAccess: def non_viewable(cls, split: str) -> SplitAccess: return cls(split=split, access=SplitAccessLevel.non_viewable) + @classmethod + def no_access(cls, split: str) -> SplitAccess: + return cls(split=split, access=SplitAccessLevel.no_access) + default_split_accesses = ( - SplitAccess.non_viewable(DefaultSplitNames.test), + SplitAccess.no_access(DefaultSplitNames.test), SplitAccess.non_viewable(DefaultSplitNames.validation), ) def get_non_viewable_splits(split_accesses: list[SplitAccess]) -> list[str]: - """Extract non-viewable splits from a list of SplitAccess.""" + """Splits whose rows/details are not viewable (non_viewable and no_access). + + no_access is strictly more restrictive than non_viewable, so it is excluded + everywhere non_viewable is. The non_viewable/no_access distinction (summary + + agent-evaluable vs. not) is enforced in the evaluation engine, not here. + """ return [ - sa.split for sa in split_accesses if sa.access == SplitAccessLevel.non_viewable + sa.split + for sa in split_accesses + if sa.access in (SplitAccessLevel.non_viewable, SplitAccessLevel.no_access) ] diff --git a/vero/src/vero/core/db/result.py b/vero/src/vero/core/db/result.py index c22c0df..1a81608 100644 --- a/vero/src/vero/core/db/result.py +++ b/vero/src/vero/core/db/result.py @@ -3,12 +3,11 @@ import json import logging import traceback -from dataclasses import dataclass from enum import Enum from typing import TYPE_CHECKING, Any, Sequence from uuid import uuid4 -from pydantic import BaseModel, Field +from pydantic import BaseModel, ConfigDict, Field, model_validator from vero.core.constants import default_maximum_score, default_minimum_score from vero.core.db.dataset import DatasetSample @@ -20,20 +19,42 @@ logger = logging.getLogger(__name__) -@dataclass -class TaskOutput: - """Non-serializable output of an agent on a single task. Used within a subprocess to collate the outputs of the inference process. +class TaskOutput(BaseModel): + """Serializable output of inference on a single task. + + Persisted between the inference and scoring stages, so it must be + JSON-serializable. An ``Exception`` passed to ``error`` is coerced to its + string form, with the traceback captured into ``error_traceback``. Attributes: output: The output of the agent on the task. - error: An optional error string, e.g. the traceback of the error. - execution_trace: An optional list of spans indicating details of the inference process. + error: An error string (e.g. ``str(exception)``). + error_traceback: Full traceback string if inference raised. + execution_trace: An optional list of spans describing the inference process. """ + model_config = ConfigDict(arbitrary_types_allowed=True) + output: Any = None - error: Exception | None = None + error: str | None = None + error_traceback: str | None = None execution_trace: Sequence[Any] | None = None + @model_validator(mode="before") + @classmethod + def _coerce_exception_error(cls, data: Any) -> Any: + """Accept an Exception in ``error`` and convert it to str + traceback.""" + if isinstance(data, dict): + err = data.get("error") + if isinstance(err, BaseException): + data = dict(data) + data["error"] = str(err) + if not data.get("error_traceback"): + data["error_traceback"] = "".join( + traceback.format_exception(type(err), err, err.__traceback__) + ) + return data + class TaskResult(BaseModel): """Serializable evaluation result for a single task. Used across processes for long-term storage of evaluation results. @@ -62,21 +83,12 @@ class TaskResult(BaseModel): @classmethod def from_task_output(cls, task_output: TaskOutput, **kwargs: Any) -> TaskResult: - """Create a TaskResult from a TaskOutput.""" - - if isinstance(task_output.error, Exception): - kwargs["error"] = str(task_output.error) - kwargs["error_traceback"] = "".join( - traceback.format_exception( - type(task_output.error), - task_output.error, - task_output.error.__traceback__, - ) - ) - - kwargs["execution_trace"] = task_output.execution_trace + """Create a TaskResult from a (serializable) TaskOutput.""" kwargs["output"] = task_output.output - + kwargs["execution_trace"] = task_output.execution_trace + if task_output.error is not None: + kwargs.setdefault("error", task_output.error) + kwargs.setdefault("error_traceback", task_output.error_traceback) return cls(**kwargs) @@ -113,6 +125,10 @@ def is_error(self) -> bool: or self.error_traceback is not None ) + def is_scored(self) -> bool: + """True once the scoring stage has run for this sample (score or eval_error set).""" + return self.score is not None or self.eval_error is not None + def as_pandas_series(self, exclude: set[str] | None = None) -> Series: """Return the sample result in a pandas representation.""" import pandas as pd diff --git a/vero/src/vero/core/task/__init__.py b/vero/src/vero/core/task/__init__.py index dd9471b..a034e4b 100644 --- a/vero/src/vero/core/task/__init__.py +++ b/vero/src/vero/core/task/__init__.py @@ -6,6 +6,7 @@ def create_task( register: bool = True, task_parameters: type | None = None, required_env_vars: list[str] | None = None, + label_fields: list[str] | None = None, ) -> VeroTask: """Create a VeroTask for use in user code. @@ -15,6 +16,10 @@ def create_task( task_parameters: Optional TaskParameters subclass for early validation. required_env_vars: Environment variables that must be set for this task to run (e.g. ``["LITELLM_BASE_URL", "LITELLM_API_KEY"]``). + label_fields: Dataset columns that hold labels/ground truth. These are + stripped from each sample before it is passed to inference, so the + (agent-authored) inference code never sees them; scoring still gets + the full row. A static, immutable property of the task definition. Returns: A new VeroTask instance. @@ -24,6 +29,7 @@ def create_task( register=register, task_parameters_type=task_parameters, required_env_vars=required_env_vars, + label_fields=label_fields, ) diff --git a/vero/src/vero/core/task/task.py b/vero/src/vero/core/task/task.py index c9e605f..6e1f651 100644 --- a/vero/src/vero/core/task/task.py +++ b/vero/src/vero/core/task/task.py @@ -14,7 +14,12 @@ from vero.core.db.dataset import DatasetSample from vero.core.db.result import SampleResult, TaskOutput, TaskResult from vero.core.evaluation import EvaluationParameters -from vero.core.sessions import get_vero_home_dir, save_sample_result +from vero.core.sessions import ( + get_vero_home_dir, + load_all_sample_results, + load_sample_result, + save_sample_result, +) from vero.core.utils import limited_gather, maybe_await logger = logging.getLogger(__name__) @@ -79,6 +84,7 @@ def __init__( register: bool = True, task_parameters_type: type | None = None, required_env_vars: list[str] | None = None, + label_fields: list[str] | None = None, ): """Initialize a VeroTask. @@ -90,12 +96,16 @@ def __init__( required_env_vars: Environment variables that must be set for this task to run (e.g. ``["LITELLM_BASE_URL", "LITELLM_API_KEY"]``). Checked before the evaluation subprocess starts. + label_fields: Dataset columns holding labels/ground truth. Stripped from + each sample before inference (so inference never sees them); scoring + receives the full row. Static, immutable task property. """ self.name = name self._functions: dict[str, Callable] = {} self._batch_functions: dict[str, Callable] = {} self._task_parameters_type = task_parameters_type self.required_env_vars: list[str] = required_env_vars or [] + self.label_fields: list[str] = label_fields or [] if register: if name in VeroTask._registry: @@ -502,76 +512,111 @@ async def evaluate_safely(task: TaskT, output: TaskOutput) -> TaskResult: # Results # ------------------------------------------------------------------------- - def compile_and_save_sample_results( - self, - evaluation_parameters: EvaluationParameters, - results: list[TaskResult | Exception], - task_data: Sequence[dict[str, JsonValue]] | None = None, - ) -> dict[str, int | float | None]: - """Compile results into SampleResult objects and save to disk. + # ------------------------------------------------------------------------- + # Per-stage persistence + # + # Each sample is persisted to its own ``samples/{id}.json`` file as it + # completes: a partial SampleResult after inference (score=None), then the + # same file updated with scoring fields. This makes every stage independently + # runnable and resumable from any partial state. + # ------------------------------------------------------------------------- - Args: - evaluation_parameters: Evaluation parameters. - results: List of evaluation results or exceptions. - task_data: Raw task data dicts for each sample (used to populate input field). + def _sessions_dir(self) -> Path: + return get_vero_home_dir() / "sessions" - Returns: - Metrics dictionary. + def _scrub_inputs(self, row: Any) -> Any: + """Strip ``label_fields`` from a sample before it reaches inference. + + Only applies to mapping rows; non-mapping rows pass through unchanged. """ + if not self.label_fields: + return row + try: + return {k: v for k, v in dict(row).items() if k not in self.label_fields} + except (TypeError, ValueError): + return row + + def _dataset_sample( + self, params: EvaluationParameters, sample_id: int + ) -> DatasetSample: + return DatasetSample( + sample_id=sample_id, + split=params.run.dataset_subset.split, + dataset_id=params.run.dataset_subset.dataset_id, + ) + + def _save_inference( + self, + params: EvaluationParameters, + sample_id: int, + task_data: Sequence[dict[str, JsonValue]] | None, + pos: int, + output: TaskOutput, + ) -> None: + """Persist a partial SampleResult holding only the inference output.""" + sample_input = ( + self._scrub_inputs(dict(task_data[pos])) + if task_data is not None and pos < len(task_data) + else None + ) + sample_result = SampleResult.from_task_result( + dataset_sample=self._dataset_sample(params, sample_id), + task_result=TaskResult.from_task_output(output), + commit=params.run.candidate.commit, + result_id=params.result_id, + input=sample_input, + ) + save_sample_result( + self._sessions_dir(), + params.session_id, + params.result_id, + sample_id=sample_id, + result=sample_result, + ) + + def _save_score( + self, + params: EvaluationParameters, + sample_result: SampleResult, + task_result: TaskResult, + ) -> None: + """Update a persisted SampleResult with scoring-stage fields and re-save.""" + sample_result.score = task_result.score + sample_result.feedback = task_result.feedback + sample_result.metrics = task_result.metrics + sample_result.eval_error = task_result.eval_error + sample_result.eval_trace = task_result.eval_trace + if task_result.error_traceback and sample_result.error_traceback is None: + sample_result.error_traceback = task_result.error_traceback + save_sample_result( + self._sessions_dir(), + params.session_id, + params.result_id, + sample_id=sample_result.dataset_sample.sample_id, + result=sample_result, + ) + + def compute_metrics( + self, params: EvaluationParameters + ) -> dict[str, int | float | None]: + """Compute metrics from the SampleResults persisted on disk.""" from vero.core.constants import default_minimum_score - metrics = { - "num_samples": len(results), + sample_results = load_all_sample_results( + self._sessions_dir(), params.session_id, params.result_id + ) + + metrics: dict[str, int | float | None] = { + "num_samples": len(sample_results), "num_errors": 0, "avg_score": 0, "avg_filled_score": None, } - - sample_results: dict[int, SampleResult] = {} - sample_ids = evaluation_parameters.run.dataset_subset.sample_ids - if sample_ids is None: - sample_ids = list(range(len(results))) - - commit = evaluation_parameters.run.candidate.commit - result_id = evaluation_parameters.result_id - - for idx, (sample_id, result) in enumerate(zip(sample_ids, results)): - dataset_sample = DatasetSample( - sample_id=sample_id, - split=evaluation_parameters.run.dataset_subset.split, - dataset_id=evaluation_parameters.run.dataset_subset.dataset_id, - ) - - sample_input = ( - dict(task_data[idx]) - if task_data is not None and idx < len(task_data) - else None - ) - common_kwargs = { - "commit": commit, - "result_id": result_id, - "input": sample_input, - } - - if isinstance(result, Exception): - error = "".join( - traceback.format_exception( - type(result), result, result.__traceback__ - ) - ) - sample_results[sample_id] = SampleResult( - dataset_sample=dataset_sample, error=error, **common_kwargs - ) - metrics["num_errors"] = metrics["num_errors"] + 1 - else: - sample_results[sample_id] = SampleResult.from_task_result( - dataset_sample=dataset_sample, task_result=result, **common_kwargs - ) - - if result.error is not None or result.eval_error is not None: - metrics["num_errors"] = metrics["num_errors"] + 1 - elif result.score is not None: - metrics["avg_score"] = metrics["avg_score"] + result.score + for sr in sample_results.values(): + if sr.error is not None or sr.eval_error is not None: + metrics["num_errors"] += 1 + elif sr.score is not None: + metrics["avg_score"] += sr.score metrics["num_successes"] = metrics["num_samples"] - metrics["num_errors"] @@ -581,7 +626,6 @@ def compile_and_save_sample_results( metrics["avg_score"] = None metrics["avg_filled_score"] = metrics["avg_score"] - if metrics["avg_score"] is None: metrics["avg_filled_score"] = default_minimum_score elif metrics["num_errors"] > 0: @@ -590,19 +634,6 @@ def compile_and_save_sample_results( + metrics["num_errors"] * default_minimum_score ) / metrics["num_samples"] - if sample_results: - vero_home = get_vero_home_dir() - sessions_dir = vero_home / "sessions" - for sample_id, result in sample_results.items(): - save_sample_result( - sessions_dir, - evaluation_parameters.session_id, - evaluation_parameters.result_id, - sample_id=sample_id, - result=result, - ) - logger.info(f"Saved {len(sample_results)} sample results") - return metrics # ------------------------------------------------------------------------- @@ -639,14 +670,160 @@ def _validate_required_functions(self) -> None: + "\n".join(f" - {e}" for e in errors) ) + async def run_inference_stage(self, params: EvaluationParameters) -> None: + """Run (or resume) inference, persisting each sample as it completes. + + Resume: samples whose ``samples/{id}.json`` already exists are skipped. + Per-sample inference persists incrementally; a batch inference function + persists after the batch returns. + """ + tasks, task_data = self._load_and_prepare_data(params) + sample_ids = params.run.dataset_subset.sample_ids + if sample_ids is None: + sample_ids = list(range(len(tasks))) + + sessions_dir = self._sessions_dir() + pending = [ + (pos, sid) + for pos, sid in enumerate(sample_ids) + if load_sample_result(sessions_dir, params.session_id, params.result_id, sid) + is None + ] + if not pending: + logger.info("Inference stage: all samples already persisted; skipping") + return + + single_fn = self.get("run_inference", batch=False) + batch_fn = self.get("run_inference", batch=True) + if single_fn is None and batch_fn is None: + raise RuntimeError( + "No inference function registered. " + "Use @task.inference() or @task.inference(batch=True) to register one." + ) + + if single_fn is not None: + + async def infer_and_save(pos: int, sid: int) -> TaskOutput: + output = self.cast_to_task_output( + await maybe_await(single_fn(self._scrub_inputs(tasks[pos]), params)) + ) + self._save_inference(params, sid, task_data, pos, output) + return output + + results = await limited_gather( + coro_factories=[ + (lambda p=pos, s=sid: infer_and_save(p, s)) for pos, sid in pending + ], + limit=params.max_concurrency, + retry_config=params.retry_config, + desc="Running inference", + return_exceptions=True, + timeout=params.sample_timeout, + run_in_thread=params.use_threading, + ) + # Persist an error record for samples that exhausted retries. + for (pos, sid), res in zip(pending, results): + if isinstance(res, Exception): + self._save_inference( + params, sid, task_data, pos, TaskOutput(error=res) + ) + else: + outputs = await self.run_batch_inference( + [self._scrub_inputs(tasks[pos]) for pos, _ in pending], params + ) + for (pos, sid), output in zip(pending, outputs): + self._save_inference(params, sid, task_data, pos, output) + + logger.info(f"Inference stage complete: {len(pending)} samples") + + async def run_scoring_stage(self, params: EvaluationParameters) -> None: + """Run (or resume) scoring over persisted inference outputs. + + Skips samples that errored during inference (terminal) or are already + scored. Reads inference outputs from disk and re-persists with scores. + """ + tasks, _ = self._load_and_prepare_data(params) + sample_ids = params.run.dataset_subset.sample_ids + if sample_ids is None: + sample_ids = list(range(len(tasks))) + + existing = load_all_sample_results( + self._sessions_dir(), params.session_id, params.result_id + ) + pending: list[tuple[int, SampleResult]] = [] + for pos, sid in enumerate(sample_ids): + sr = existing.get(sid) + if sr is None: + logger.warning( + f"Scoring stage: no inference result for sample {sid}; skipping" + ) + continue + if sr.error is not None: # inference error is terminal + continue + if sr.is_scored(): + continue + pending.append((pos, sr)) + if not pending: + logger.info("Scoring stage: nothing to score; skipping") + return + + single_fn = self.get("run_evaluation", batch=False) + batch_fn = self.get("run_evaluation", batch=True) + if single_fn is None and batch_fn is None: + raise RuntimeError( + "No evaluation function registered. " + "Use @task.evaluation() or @task.evaluation(batch=True) to register one." + ) + + def _output(sr: SampleResult) -> TaskOutput: + return TaskOutput( + output=sr.output, error=sr.error, execution_trace=sr.execution_trace + ) + + if single_fn is not None: + + async def score_and_save(pos: int, sr: SampleResult) -> None: + result = await maybe_await(single_fn(tasks[pos], _output(sr), params)) + self._save_score(params, sr, self.cast_to_task_result(_output(sr), result)) + + results = await limited_gather( + coro_factories=[ + (lambda p=pos, r=sr: score_and_save(p, r)) for pos, sr in pending + ], + limit=params.max_concurrency, + retry_config=params.retry_config, + desc="Evaluating samples", + return_exceptions=True, + timeout=params.sample_timeout, + run_in_thread=params.use_threading, + ) + for (pos, sr), res in zip(pending, results): + if isinstance(res, Exception): + self._save_score( + params, sr, self.cast_to_task_result(_output(sr), res) + ) + else: + eval_results = await self.run_batch_evaluation( + [tasks[pos] for pos, _ in pending], + [_output(sr) for _, sr in pending], + params, + ) + for (pos, sr), task_result in zip(pending, eval_results): + self._save_score(params, sr, task_result) + + logger.info(f"Scoring stage complete: {len(pending)} samples") + async def run(self, params: EvaluationParameters) -> dict[str, Any]: - """Run the complete evaluation pipeline. + """Run the full evaluation pipeline as two resumable stages. + + Inference and scoring each persist per-sample as they complete and skip + already-done samples, so a crashed run resumes from its partial state. Args: params: Evaluation parameters. Returns: - Metrics dictionary. + Metrics dictionary (computed from the persisted sample results). Raises: RuntimeError: If required functions are not registered. @@ -660,21 +837,11 @@ async def run(self, params: EvaluationParameters) -> dict[str, Any]: # Validate required functions are registered self._validate_required_functions() - # Step 1: Load and prepare data - tasks, task_data = self._load_and_prepare_data(params) - logger.info(f"Loaded {len(tasks)} samples") - - # Step 2: Run inference - outputs = await self.run_batch_inference(tasks, params) + await self.run_inference_stage(params) + await self.run_scoring_stage(params) - # Step 3: Run evaluation - results = await self.run_batch_evaluation(tasks, outputs, params) - logger.info(f"Processed {len(results)} samples") - - # Step 4: Compile and save results - metrics = self.compile_and_save_sample_results(params, results, task_data) + metrics = self.compute_metrics(params) logger.info(f"Logged results: {metrics}") - return metrics def __repr__(self) -> str: diff --git a/vero/src/vero/evaluation/__init__.py b/vero/src/vero/evaluation/__init__.py new file mode 100644 index 0000000..816e191 --- /dev/null +++ b/vero/src/vero/evaluation/__init__.py @@ -0,0 +1,19 @@ +"""Evaluation: the Evaluator (checkout + run) and the EvaluationEngine that +orchestrates it (sample resolution + budget metering). The in-process +ExperimentRunnerTool and the Harbor eval sidecar are both frontends over the engine. +""" + +from vero.evaluation.engine import EvalRequest, EvaluationEngine +from vero.evaluation.evaluator import ( + Evaluator, + isolate_project, + run_evaluation, +) + +__all__ = [ + "Evaluator", + "isolate_project", + "run_evaluation", + "EvaluationEngine", + "EvalRequest", +] diff --git a/vero/src/vero/evaluation/engine.py b/vero/src/vero/evaluation/engine.py new file mode 100644 index 0000000..5524c57 --- /dev/null +++ b/vero/src/vero/evaluation/engine.py @@ -0,0 +1,188 @@ +"""EvaluationEngine: the shared evaluation core. + +Wraps the :class:`~vero.evaluator.Evaluator` with budget metering and the +dataset/split allowlist. It is the single eval path used by both the in-process +``ExperimentRunnerTool`` (in-memory budget) and the Harbor eval sidecar (durable +budget + HTTP frontend). It returns the **full** ``Experiment`` — redaction, +write-routing, and human/wire formatting are the frontend's job, not the core's. +""" + +from __future__ import annotations + +import logging +from dataclasses import dataclass +from pathlib import Path +from typing import TYPE_CHECKING + +from vero.core.budget import BudgetLedger, SplitBudget +from vero.core.evaluation import BaseEvaluationParameters + +if TYPE_CHECKING: + from vero.core.db.database import Experiment, ExperimentDatabase + from vero.evaluation.evaluator import Evaluator + +logger = logging.getLogger(__name__) + + +@dataclass +class EvalRequest: + """A request to evaluate a commit on a dataset split. + + Also the agent-facing wire payload in the Harbor case. ``task`` is + not a field — it is fixed config bound on the service, not agent-chosen. + """ + + dataset_id: str + split: str + commit: str | None = None # None -> resolved by the caller (e.g. agent repo HEAD) + sample_ids: list[int] | None = None + num_samples: int | None = None + + +class EvaluationEngine: + """Resolve samples -> meter budget -> run the Evaluator -> full Experiment.""" + + def __init__( + self, + *, + evaluator: Evaluator, + budget: BudgetLedger, + default_task: str | None = None, + db: ExperimentDatabase | None = None, + run_constraints: BaseEvaluationParameters | None = None, + session_id: str | None = None, + vero_home: Path | None = None, + ): + self.evaluator = evaluator + self.budget = budget + self.default_task = default_task + self.db = db + self.run_constraints = run_constraints or BaseEvaluationParameters() + self.session_id = session_id + self.vero_home = vero_home + + @classmethod + def from_session(cls, session) -> EvaluationEngine: + """Build a service from a bound Session (mirrors ExperimentRunnerTool.bind).""" + from copy import deepcopy + + return cls( + evaluator=session.evaluator, + budget=BudgetLedger(deepcopy(session.budget)), + default_task=session.task, + db=session.db, + run_constraints=session.evaluation_parameters, + session_id=session.session_id, + vero_home=session.vero_home, + ) + + # ------------------------------------------------------------------ + # Dataset / sample resolution (lifted from ExperimentRunnerTool) + # ------------------------------------------------------------------ + + def _get_dataset_info(self, dataset_id: str): + from vero.core.dataset import DatasetInfo + from vero.core.dataset.store import load_dataset + + sessions_dir = self.vero_home / "sessions" if self.vero_home else None + dataset_cache = self.vero_home / "datasets" if self.vero_home else None + dataset = load_dataset(sessions_dir, dataset_cache, self.session_id, dataset_id) + return DatasetInfo( + id=dataset_id, + splits={split: len(dataset[split]) for split in dataset}, + features={split: list(dataset[split].features) for split in dataset}, + ) + + def _get_samples_from_split( + self, dataset_id: str, split: str, num_samples: int + ) -> list[int] | None: + """First-N sample ids, or None when N covers (or exceeds) the whole split.""" + split_size = self._get_dataset_info(dataset_id).splits[split] + num_samples = min(num_samples, split_size) + if num_samples >= split_size: + return None + return list(range(num_samples)) + + def _validate_and_count_samples( + self, dataset_id: str, split: str, sample_ids: list[int] | None = None + ) -> int: + """Validate sample ids are in range; return the count (full split if None).""" + split_size = self._get_dataset_info(dataset_id).splits[split] + if sample_ids is None: + return split_size + invalid = [s for s in sample_ids if s < 0 or s >= split_size] + if invalid: + raise ValueError( + f"The provided sample ids are outside the range of the split " + f"[0, {split_size - 1}]: {invalid}" + ) + return len(sample_ids) + + def resolve_samples(self, req: EvalRequest) -> tuple[list[int] | None, int]: + """Resolve (sample_ids, count) for a request. Raises on invalid combos.""" + if req.sample_ids is not None and req.num_samples is not None: + raise ValueError( + "Cannot specify both sample_ids and num_samples. " + "Use sample_ids for specific samples, or num_samples for the first N samples." + ) + sample_ids = req.sample_ids + if req.num_samples is not None: + sample_ids = self._get_samples_from_split( + req.dataset_id, req.split, req.num_samples + ) + count = self._validate_and_count_samples(req.dataset_id, req.split, sample_ids) + return sample_ids, count + + # ------------------------------------------------------------------ + # Evaluation + # ------------------------------------------------------------------ + + async def evaluate(self, req: EvalRequest, *, admin: bool = False) -> Experiment: + """Meter (unless admin) and run one evaluation; return the full Experiment. + + ``no_access`` gating is implicit: those splits are absent from the budget + ledger, so ``reserve`` raises ``InvalidSplitError`` for the agent; admin + bypasses the ledger and may evaluate anything. + """ + sample_ids, n = self.resolve_samples(req) + if not admin: + await self.budget.reserve(req.dataset_id, req.split, n) + return await self.evaluator.evaluate( + commit=req.commit, + dataset_id=req.dataset_id, + split=req.split, + task=self.default_task, + sample_ids=sample_ids, + db=self.db, + evaluation_parameters=self.run_constraints, + ) + + async def evaluate_admin( + self, + *, + task: str, + dataset_id: str, + split: str, + commit: str, + sample_ids: list[int] | None = None, + ) -> Experiment: + """Admin/verifier evaluation: explicit ``task``, no budget, no allowlist. + + Unlike :meth:`evaluate` (which is bound to ``default_task`` and metered), + this scores an arbitrary ``(task, dataset_id, split)`` — including held-out + tasks/splits the agent never had access to. Used by the verifier to score + the selected commit on its configured targets. + """ + return await self.evaluator.evaluate( + commit=commit, + dataset_id=dataset_id, + split=split, + task=task, + sample_ids=sample_ids, + db=self.db, + evaluation_parameters=self.run_constraints, + ) + + def status(self) -> dict[tuple[str, str], SplitBudget]: + """Remaining budget per (split, dataset_id).""" + return self.budget.status() diff --git a/vero/src/vero/evaluation/evaluator.py b/vero/src/vero/evaluation/evaluator.py new file mode 100644 index 0000000..c13f7c0 --- /dev/null +++ b/vero/src/vero/evaluation/evaluator.py @@ -0,0 +1,836 @@ +from __future__ import annotations + +import json +import logging +import os +import random +import traceback +from pathlib import Path + +import yaml +from rich.panel import Panel +from rich.syntax import Syntax + +from vero.core.cli_adapters import UvRunParameters +from vero.core.constants import ( + evaluation_parameters_basename, + evaluation_results_basename, + pytest_report_basename, + result_metadata_basename, + samples_dir_name, +) +from vero.core.db.candidate import Candidate +from vero.core.db.database import Experiment, ExperimentDatabase +from vero.core.db.dataset import DatasetSubset +from vero.core.db.result import ExperimentResult, SampleResult +from vero.core.db.run import ExperimentRun +from vero.core.evaluation import BaseEvaluationParameters, EvaluationParameters +from vero.core.sessions import ( + clear_result_cache, + get_experiment_dir, + get_session_dir, + get_vero_home_dir, + initialize_result_store, + load_all_sample_results, + save_json_to_cache, +) +from vero.core.task.utils import get_discover_cmd, get_run_cmd +from vero.evaluation.strategy import EvalStrategy +from vero.exceptions import ExperimentRunFailedError +from vero.logging import setup_console +from vero.utils import run_subprocess_with_tee +from vero.workspace import Workspace +from vero.workspace.git import GitWorkspace + +console = setup_console() + +logger = logging.getLogger(__name__) + + +class Evaluator: + """Evaluates experiment runs by checking out commits and running tasks in subprocesses.""" + + def __init__( + self, + workspace: Workspace, + session_id: str, + *, + vero_home: Path | None = None, + use_copy: bool = False, + hooks: list[str] | None = None, + sync: bool = False, + subprocess_env_vars: list | Path | str | None = None, + task_project: Path | None = None, + task_module: str | None = None, + eval_strategy: EvalStrategy | None = None, + ): + self.workspace = workspace + self.session_id = session_id + self.vero_home = vero_home or get_vero_home_dir() + self.use_copy = use_copy + self.hooks = hooks if hooks is not None else ["setup_logging"] + self.sync = sync + self._subprocess_env_vars = subprocess_env_vars + self.task_project = task_project + self.task_module = task_module + # Mode-specific "produce sample results" step. None = Mode A (task.utils, + # inline below). A strategy (e.g. Harbor Mode B) is injected by the caller. + self.eval_strategy = eval_strategy + self.on_experiment: list = [] # Callbacks fired after each evaluate() + + @property + def sessions_dir(self) -> Path: + return self.vero_home / "sessions" + + @property + def dataset_cache(self) -> Path: + return self.vero_home / "datasets" + + @property + def subprocess_env(self) -> dict[str, str] | None: + """Build subprocess env on demand from var names. Returns None to inherit os.environ.""" + if self._subprocess_env_vars is None: + return None + from vero.utils.subprocess_env import build_subprocess_env + + return build_subprocess_env(self._subprocess_env_vars) + + def _get_subprocess_env_with_vero_home(self) -> dict[str, str] | None: + """Build subprocess env and ensure VERO_HOME_DIR is set.""" + env = self.subprocess_env + if env is not None: + env["VERO_HOME_DIR"] = str(self.vero_home) + return env + + @staticmethod + def log_evaluation_results(result: ExperimentResult) -> None: + """Logs the evaluation results to the console.""" + stats = ( + result.sample_results_statistics( + as_dict=True, convert_lists_to_strings=True + ) + or {} + ) + if len(stats) > 0: + syntax = Syntax( + yaml.dump(stats, sort_keys=False), + "yaml", + theme="monokai", + line_numbers=False, + ) + console.print( + Panel( + syntax, + title="[bold green]⚙️ Evaluation Statistics[/bold green]", + border_style="green", + ) + ) + else: + console.print(f"No ExperimentResult found for run {result.run_id}.") + + def load_sample_results_from_cache( + self, evaluation_parameters: EvaluationParameters + ) -> dict[int, SampleResult]: + """Load the sample results from the cache. + + Tries to load from per-sample files first (new format), then falls back + to the single JSON file (legacy format) for backward compatibility. + """ + sample_results = load_all_sample_results( + self.sessions_dir, self.session_id, evaluation_parameters.result_id + ) + + if not sample_results: + logger.warning( + f"No sample results found for run {evaluation_parameters.run.id}." + ) + + return sample_results + + def _get_uv_params( + self, agent_project_path: Path | str + ) -> tuple[UvRunParameters, Path | str]: + """Build UvRunParameters and determine cwd for subprocess. + + When task_project is set, runs uv in the task project and layers + the agent code on top via --with-editable. Otherwise runs in the + agent project directly (backward compat). + + Returns: + (uv_params, cwd) tuple. + """ + if self.task_project: + return ( + UvRunParameters.from_env( + project=str(self.task_project), + with_editable=str(agent_project_path), + ), + self.task_project, + ) + return UvRunParameters.from_env( + project=str(agent_project_path) + ), agent_project_path + + async def _discover_tasks(self, project_path: Path | str) -> dict: + """Discover tasks via isolated subprocess. + + Args: + project_path: Path to the agent project. + + Returns: + Dictionary with package name and task metadata. + """ + uv_params, cwd = self._get_uv_params(project_path) + cmd = [*uv_params.get_cmd(), *get_discover_cmd(task_module=self.task_module)] + result = await run_subprocess_with_tee( + cmd, + timeout=60, + cwd=str(cwd), + flush=False, + tee_stdout=False, + env=self._get_subprocess_env_with_vero_home(), + ) + + if result.returncode != 0: + raise ExperimentRunFailedError( + f"Task discovery failed. Error: {result.stderr}.", + stdout=result.stdout, + stderr=result.stderr, + returncode=int(result.returncode), + ) + + return json.loads(result.stdout) + + async def _run_task( + self, + project_path: Path | str, + task_name: str, + params_file: Path, + timeout: int = 60 * 10, + ) -> dict | None: + """Execute task via isolated subprocess. + + Args: + project_path: Path to the user's project. + task_name: Name of the task to execute. + params_file: Path to JSON file containing EvaluationParameters. + timeout: Subprocess timeout in seconds. + + Returns: + Metrics dictionary from task execution, or None if parsing fails. + """ + uv_params, cwd = self._get_uv_params(project_path) + cmd = [ + *uv_params.get_cmd(), + *get_run_cmd( + task_name, params_file, hooks=self.hooks, task_module=self.task_module + ), + ] + result = await run_subprocess_with_tee( + cmd, + timeout=timeout, + cwd=cwd, + flush=True, + env=self._get_subprocess_env_with_vero_home(), + ) + logger.info("Subprocess complete!") + + # Save subprocess output for debugging + log_dir = params_file.parent + if result.stderr: + (log_dir / "subprocess_stderr.log").write_text(result.stderr) + if result.stdout: + (log_dir / "subprocess_stdout.log").write_text(result.stdout) + if result.returncode != 0: + (log_dir / "subprocess_returncode.txt").write_text(str(result.returncode)) + logger.warning( + f"Subprocess exited with code {result.returncode}. " + f"Stderr: {result.stderr[:500] if result.stderr else '(empty)'}" + ) + + # Read metrics from file (written by task subprocess) + metrics_path = log_dir / "metrics.json" + if metrics_path.exists(): + try: + return json.loads(metrics_path.read_text()) + except json.JSONDecodeError: + logger.warning(f"Failed to parse {metrics_path} as JSON") + return None + else: + logger.warning(f"Metrics file not found at {metrics_path}") + return None + + async def _run_task_in_subprocess( + self, + params: EvaluationParameters, + workspace: Workspace, + ) -> None: + """Run task via vero.task_utils subprocess. + + Args: + params: Evaluation parameters (must have task set). + workspace: Workspace to run in. + + Raises: + ExperimentRunFailedError: If task discovery or execution fails. + """ + + # Discover available tasks first + try: + discovery_result = await self._discover_tasks(workspace.project_path) + except Exception as e: + error_str = "".join(traceback.format_exception(type(e), e, e.__traceback__)) + raise ExperimentRunFailedError( + f"Task discovery failed. Error: {error_str}.", + stdout="", + stderr=error_str, + returncode=1, + ) + + # Validate the requested task exists + available_tasks = discovery_result.get("tasks", {}) + if params.task not in available_tasks: + available_names = list(available_tasks.keys()) + raise ExperimentRunFailedError( + f"Task '{params.task}' not found in package '{discovery_result.get('package', 'unknown')}'.\n" + f"Available tasks: {available_names if available_names else '(none found)'}\n" + f"Ensure your task is registered in vero_tasks/__init__.py", + stdout="", + stderr="", + returncode=1, + ) + + # Validate required environment variables + required_env = available_tasks[params.task].get("required_env_vars", []) + if required_env: + missing = [v for v in required_env if not os.environ.get(v)] + if missing: + raise ExperimentRunFailedError( + f"Task '{params.task}' requires environment variables that are not set: " + f"{', '.join(missing)}. Set them before running.", + stdout="", + stderr="", + returncode=1, + ) + + # Run the task + result_dir = get_experiment_dir( + self.sessions_dir, self.session_id, params.result_id + ) + params_file = result_dir / evaluation_parameters_basename + logger.info( + f"Running task '{params.task}' via vero.task_utils in {workspace.project_path}" + ) + try: + metrics = await self._run_task( + workspace.project_path, + params.task, + params_file, + timeout=params.timeout, + ) + logger.info(f"Task completed with metrics: {metrics}") + except Exception as e: + error_str = "".join(traceback.format_exception(type(e), e, e.__traceback__)) + raise ExperimentRunFailedError( + f"Task execution failed. Error: {error_str}.", + stdout="", + stderr=error_str, + returncode=1, + ) + + async def run( + self, + evaluation_parameters: EvaluationParameters, + use_copy: bool | None = None, + ) -> ExperimentResult: + """Run an experiment by checking out the candidate commit and running tasks via uv. + + Args: + evaluation_parameters: The parameters for the evaluation. + use_copy: Override for self.use_copy. If True, creates a temporary isolated copy + of the workspace (always clean). If False, uses the current workspace (requires clean state). + + Returns: + ExperimentResult with sample results and metadata. + """ + use_copy = use_copy if use_copy is not None else self.use_copy + + if not use_copy: + return await self._run_in_workspace(evaluation_parameters, self.workspace) + + async with self.workspace.temp_copy( + from_version=evaluation_parameters.run.candidate.commit, + ) as temp_workspace: + return await self._run_in_workspace(evaluation_parameters, temp_workspace) + + async def evaluate( + self, + commit: str, + dataset_id: str, + split: str, + task: str | None = None, + sample_ids: list[int] | None = None, + db: ExperimentDatabase | None = None, + evaluation_parameters: BaseEvaluationParameters | None = None, + use_copy: bool | None = None, + ) -> Experiment: + """Full evaluation lifecycle: resolve commit → run → create experiment → DB → hooks. + + This is the single entry point for all evaluations. Both Policy.evaluate_commit() + and ExperimentRunnerTool delegate here. + + Args: + commit: Git commit hash or ref to evaluate. + dataset_id: Dataset ID in the session store. + split: Dataset split to evaluate. + task: Task name to execute. + sample_ids: Specific sample IDs to evaluate (None = all). + db: ExperimentDatabase to record the experiment in. + evaluation_parameters: Base eval params (timeout, concurrency, etc.). + use_copy: Whether to create a temporary copy for the eval. + + Returns: + The completed Experiment with results. + """ + from vero.core.db.database import Experiment + + # Resolve commit ref to canonical version ID + try: + if isinstance(self.workspace, GitWorkspace): + full_hash = await self.workspace.resolve_ref(commit) + else: + full_hash = commit + except Exception as e: + raise ValueError( + f"Cannot resolve commit '{commit}': {e}. " + f"Make sure the commit exists in the repository." + ) + + # Build candidate + candidate = None + if db is not None: + candidate = db.get_candidate((self.workspace.name, full_hash)) + if candidate is None: + candidate = Candidate(commit=full_hash, repo_name=self.workspace.name) + + # Build run + dataset_subset = DatasetSubset( + split=split, sample_ids=sample_ids, dataset_id=dataset_id + ) + run = ExperimentRun(candidate=candidate, dataset_subset=dataset_subset) + + # Build eval params + base_params = evaluation_parameters or BaseEvaluationParameters() + params = EvaluationParameters( + **base_params.model_dump(), + run=run, + dataset_id=dataset_id, + task=task, + session_id=self.session_id, + ) + + # Run + result = await self.run(params, use_copy=use_copy) + + # Create experiment + experiment = Experiment(run=run, result=result) + + # Add to DB + if db is not None: + db.add_experiment(experiment) + + # Fire post-eval callbacks (may be sync or async) + import asyncio as _asyncio + + for callback in self.on_experiment: + try: + result = callback(experiment) + if _asyncio.iscoroutine(result): + await result + except Exception as e: + logger.warning(f"on_experiment callback failed: {e}") + + return experiment + + async def _run_in_workspace( + self, params: EvaluationParameters, workspace: Workspace + ) -> ExperimentResult: + """Run an experiment by checking out the candidate commit and running tasks via uv.""" + + # We cannot execute with a dirty workspace, as this may introduce side effects on the evaluation results. + if await workspace.is_dirty(): + raise RuntimeError( + "Evaluator cannot execute. There are unsaved changes in the workspace." + ) + + # Update the evaluation parameters with the dataset loader and session_id + params.session_id = self.session_id + + # Initialize the directory to store the evaluation and pytest report files + result_dir = initialize_result_store( + self.sessions_dir, self.session_id, params.result_id + ) + + save_json_to_cache( + self.sessions_dir, + self.session_id, + params.result_id, + basename=evaluation_parameters_basename, + data=params, + ) + logger.info( + f"Saved evaluation parameters to cache: {result_dir / evaluation_parameters_basename}" + ) + + # Git-specific: fetch from remote if configured + if self.sync and isinstance(workspace, GitWorkspace): + await workspace.maybe_fetch() + + # Clear any stale cached results before running to avoid reading old data if run fails + clear_result_cache( + self.sessions_dir, + self.session_id, + params.result_id, + result_basenames=[pytest_report_basename, evaluation_results_basename], + ) + + if self.eval_strategy is not None: + # Non-default strategy (e.g. Harbor Mode B): it owns staging + execution + + # collation, persisting SampleResults to the result store. + async with workspace.at(params.run.candidate.commit): + await self.eval_strategy.produce_sample_results( + workspace=workspace, params=params, result_dir=result_dir + ) + else: + # Mode A: ship data into the sandbox, run task.utils, copy results back. + experiment_dir = str( + get_experiment_dir(self.sessions_dir, self.session_id, params.result_id) + ) + await workspace.sandbox.upload(experiment_dir, experiment_dir) + + # Upload dataset cache so subprocess can load it + from vero.core.dataset.store import _read_mapping + + mapping = _read_mapping(self.sessions_dir, self.session_id) + dataset_fp = mapping.get(params.dataset_id or "") + if dataset_fp: + cache_entry = str(self.dataset_cache / dataset_fp) + await workspace.sandbox.upload(cache_entry, cache_entry) + # Also upload the session datasets.json mapping + session_dir = str(get_session_dir(self.sessions_dir, self.session_id)) + datasets_json = f"{session_dir}/datasets.json" + await workspace.sandbox.upload(datasets_json, datasets_json) + + # Switch to the candidate version and run the evaluation in a subprocess + async with workspace.at(params.run.candidate.commit): + await self._run_task_in_subprocess(params, workspace) + + # Transfer results back from the sandbox + await workspace.sandbox.download(experiment_dir, experiment_dir) + + sample_results = self.load_sample_results_from_cache(params) + + if not sample_results: + raise ExperimentRunFailedError( + f"No sample results found for run {params.run.id}! Likely because execution failed.", + returncode=1, + ) + else: + result = ExperimentResult.create_with_status( + id=params.result_id, + error_rate=params.error_rate_threshold, + run_id=params.run.id, + sample_results=sample_results, + ) + + # Write result metadata to disk so the DB can be reconstructed from experiments/ + save_json_to_cache( + self.sessions_dir, + self.session_id, + params.result_id, + basename=result_metadata_basename, + data={ + "id": result.id, + "run_id": result.run_id, + "status": result.status.value, + }, + ) + + self.log_evaluation_results(result) + return result + + +def _resolve_vero_dependency(isolated_dir: Path, original_project_dir: Path) -> None: + """Resolve the vero path dependency in pyproject.toml after isolation. + + When a project is isolated (copied to a new location), relative path + dependencies in ``[tool.uv.sources]`` break. This function resolves + the ``scale-vero`` dependency to an absolute path via ``uv add``. + + Raises ValueError if any *other* relative path dependencies are found, + since those are unsupported and would silently break. + """ + import subprocess + import tomllib + + pyproject_path = isolated_dir / "pyproject.toml" + if not pyproject_path.exists(): + return + + with open(pyproject_path, "rb") as f: + pyproject = tomllib.load(f) + + sources = pyproject.get("tool", {}).get("uv", {}).get("sources", {}) + if not sources: + return + + for name, source in sources.items(): + if not isinstance(source, dict) or "path" not in source: + continue + + rel_path = source["path"] + if not rel_path.startswith(".") and not rel_path.startswith("/"): + continue # Not a relative path + + if "vero" in name.lower(): + # Always resolve to the known vero package directory rather than + # trusting the relative path (which may be stale or wrong). + from vero.core.constants import PACKAGE_DIR + + abs_path = PACKAGE_DIR + editable_flag = ["--editable"] if source.get("editable") else [] + subprocess.run( + ["uv", "add", *editable_flag, "--dev", str(abs_path)], + cwd=isolated_dir, + capture_output=True, + check=True, + ) + logger.info(f"Resolved {name} dependency: {rel_path} -> {abs_path}") + else: + raise ValueError( + f"Unsupported relative path dependency '{name}' " + f"(path={rel_path!r}) in {pyproject_path}. " + f"Only vero is handled during isolation." + ) + + +def isolate_project( + project_path: Path | str, + session_id: str, + git_ref: str = "HEAD", + *, + sessions_dir: Path, +) -> Path: + """Copy a project into a fresh, standalone git repo. + + Useful when the project lives inside a monorepo or has uncommitted changes. + Extracts files at *git_ref* via ``git archive`` (falling back to a plain + copy when the source is not a git repo), then ``git init`` + ``git commit`` + so the result is a clean, self-contained repository. + + Relative path dependencies on vero in pyproject.toml are resolved to + absolute paths so they remain valid after the copy. + + Args: + project_path: Path to the project directory. + session_id: Session ID (isolated copy is placed under the session dir). + git_ref: Git ref to archive from (default: HEAD). + sessions_dir: Path to the sessions root directory. + + Returns: + Path to the isolated project root. + """ + import shutil + import subprocess + + project_path = Path(project_path).resolve() + isolated_dir = (sessions_dir / session_id) / project_path.name + isolated_dir.mkdir(parents=True, exist_ok=True) + + repo_root_result = subprocess.run( + ["git", "rev-parse", "--show-toplevel"], + cwd=project_path, + capture_output=True, + text=True, + ) + + if repo_root_result.returncode == 0: + repo_root_path = Path(repo_root_result.stdout.strip()) + project_rel = project_path.relative_to(repo_root_path) + strip = len(project_rel.parts) + + archive = subprocess.Popen( + ["git", "archive", git_ref, str(project_rel)], + cwd=repo_root_path, + stdout=subprocess.PIPE, + ) + subprocess.run( + ["tar", "xf", "-", "--strip-components", str(strip)], + cwd=isolated_dir, + stdin=archive.stdout, + check=True, + ) + archive.wait() + else: + shutil.copytree(project_path, isolated_dir, dirs_exist_ok=True) + + # Resolve vero dependency before git init (so it's in the initial commit) + _resolve_vero_dependency(isolated_dir, project_path) + + subprocess.run(["git", "init"], cwd=isolated_dir, capture_output=True, check=True) + subprocess.run( + ["git", "add", "."], cwd=isolated_dir, capture_output=True, check=True + ) + subprocess.run( + [ + "git", + "-c", + "user.name=vero", + "-c", + "user.email=vero@localhost", + "commit", + "-m", + "Initial commit (isolated)", + ], + cwd=isolated_dir, + capture_output=True, + check=True, + ) + + if repo_root_result.returncode == 0: + subprocess.run( + ["git", "remote", "add", "origin", repo_root_result.stdout.strip()], + cwd=isolated_dir, + capture_output=True, + ) + + logger.info(f"Isolated project: {project_path} -> {isolated_dir}") + return isolated_dir + + +async def run_evaluation( + project_path: Path | str, + dataset: str | Path, + split: str, + task: str | None = None, + commit: str | None = None, + sample_ids: list[int] | None = None, + num_samples: int | None = None, + task_params: dict | None = None, + seed: int = 42, + timeout: int = 3600, + per_sample_timeout: int = 180, + create_temporary_copy: bool = False, + isolate: bool = False, + hooks: list[str] | None = None, + session_id: str | None = None, + max_concurrency: int | None = None, + subprocess_env_vars: list[str] | Path | str | None = None, + task_project: Path | str | None = None, + task_module: str | None = None, + vero_home: Path | None = None, +) -> ExperimentResult: + """Run an evaluation using the given parameters. + + Args: + project_path: Path to the agent project to evaluate. + dataset: Dataset, DatasetDict, path to saved dataset, or dataset ID string. + split: Dataset split to evaluate. + task: Task name to execute from vero_tasks module. + commit: Commit to evaluate. + sample_ids: List of sample IDs to evaluate. + num_samples: Number of samples to evaluate. + task_params: Task-specific parameters for the evaluation. + seed: Random seed for sample selection. + timeout: Overall timeout for the evaluation subprocess in seconds. + per_sample_timeout: Timeout for a single sample in seconds. + create_temporary_copy: Whether to create a temporary copy for the evaluation. + isolate: Whether to copy the project into a fresh git repo before evaluating. + hooks: List of hook names to execute before task. + session_id: Session ID. + max_concurrency: Maximum concurrent tasks. + subprocess_env_vars: Environment variable names to pass to task subprocesses. + task_project: Path to a separate task project. When set, evaluator runs + uv in the task project and layers the agent via --with-editable. + task_module: Explicit Python module to import for task registration + (e.g. "my_eval_tasks.vero_tasks"). If None, auto-discovers. + vero_home: Path to the vero home directory. Defaults to ~/.vero. + + Returns: + The experiment result. + + Raises: + ExperimentRunFailedError: If the evaluation fails. + """ + from vero.core.dataset.store import resolve_and_save_dataset + + vh = vero_home or get_vero_home_dir() + sessions_dir = vh / "sessions" + dataset_cache = vh / "datasets" + + if task_params is None: + task_params = {} + + if session_id is None: + from uuid import uuid4 + + session_id = str(uuid4()) + logger.info(f"Auto-generated session ID: {session_id}") + + if isolate: + project_path = isolate_project( + project_path, session_id, sessions_dir=sessions_dir + ) + + workspace = await GitWorkspace.create(project_path) + + # Resolve and save dataset + dataset_id = resolve_and_save_dataset( + dataset, sessions_dir, dataset_cache, session_id + ) + + evaluator = Evaluator( + workspace=workspace, + use_copy=create_temporary_copy, + hooks=hooks, + session_id=session_id, + vero_home=vh, + subprocess_env_vars=subprocess_env_vars, + task_project=Path(task_project) if task_project else None, + task_module=task_module, + ) + + if commit is None: + commit = await workspace.current_version() + logger.warning(f"No commit provided, using current commit: {commit}.") + + # Sample data if num_samples is provided + if num_samples is not None and sample_ids is None: + from vero.core.dataset.store import load_dataset as _load_ds + + ds = _load_ds(sessions_dir, dataset_cache, session_id, dataset_id) + rng = random.Random(seed) + sample_ids = rng.sample(range(len(ds[split])), num_samples) + + # Build base eval params + eval_params = BaseEvaluationParameters( + timeout=timeout, + sample_timeout=per_sample_timeout, + task_params=task_params, + ) + if max_concurrency is not None: + eval_params.max_concurrency = max_concurrency + + experiment = await evaluator.evaluate( + commit=commit, + dataset_id=dataset_id, + split=split, + task=task, + sample_ids=sample_ids, + evaluation_parameters=eval_params, + use_copy=create_temporary_copy, + ) + + result_dir = get_experiment_dir(sessions_dir, session_id, experiment.id) + console.print(f"Result available at {result_dir / samples_dir_name}") + return experiment.result diff --git a/vero/src/vero/evaluation/strategy.py b/vero/src/vero/evaluation/strategy.py new file mode 100644 index 0000000..3eb0c24 --- /dev/null +++ b/vero/src/vero/evaluation/strategy.py @@ -0,0 +1,35 @@ +"""The evaluation strategy seam. + +The Evaluator handles the shared lifecycle (clean-tree check, result store, checkout, +ExperimentResult assembly) and delegates the mode-specific step — "produce per-sample +results for this candidate/split/sample_ids" — to an EvalStrategy. + +The default (Mode A) path is the in-process ``task.utils`` subprocess, kept inline in +the Evaluator. A non-default strategy (e.g. Harbor Mode B, injected from ``vero.harbor``) +implements this Protocol; the Evaluator never imports the strategy's module, keeping +``vero.evaluation`` harbor-agnostic. +""" + +from __future__ import annotations + +from pathlib import Path +from typing import TYPE_CHECKING, Protocol, runtime_checkable + +if TYPE_CHECKING: + from vero.core.evaluation import EvaluationParameters + from vero.workspace import Workspace + + +@runtime_checkable +class EvalStrategy(Protocol): + async def produce_sample_results( + self, + *, + workspace: Workspace, + params: EvaluationParameters, + result_dir: Path, + ) -> None: + """Run the evaluation for ``params.run`` (commit/split/sample_ids) against the + checked-out ``workspace`` and persist per-sample ``SampleResult``s to the result + store (so ``Evaluator`` can assemble them into an ``ExperimentResult``).""" + ... diff --git a/vero/src/vero/evaluator.py b/vero/src/vero/evaluator.py index 5b447bb..c5e9069 100644 --- a/vero/src/vero/evaluator.py +++ b/vero/src/vero/evaluator.py @@ -1,823 +1,15 @@ -from __future__ import annotations - -import json -import logging -import os -import random -import traceback -from pathlib import Path - -import yaml -from rich.panel import Panel -from rich.syntax import Syntax - -from .core.cli_adapters import UvRunParameters -from .core.constants import ( - evaluation_parameters_basename, - evaluation_results_basename, - pytest_report_basename, - result_metadata_basename, - samples_dir_name, -) -from .core.db.candidate import Candidate -from .core.db.database import Experiment, ExperimentDatabase -from .core.db.dataset import DatasetSubset -from .core.db.result import ExperimentResult, SampleResult -from .core.db.run import ExperimentRun -from .core.evaluation import BaseEvaluationParameters, EvaluationParameters -from .core.sessions import ( - clear_result_cache, - get_experiment_dir, - get_session_dir, - get_vero_home_dir, - initialize_result_store, - load_all_sample_results, - save_json_to_cache, +"""Back-compat shim. The implementation moved to ``vero.evaluation.evaluator``. + +Prefer importing from ``vero.evaluation`` going forward; this module is kept so +existing ``from vero.evaluator import ...`` imports (examples, external code) keep +working. +""" + +from vero.evaluation.evaluator import ( # noqa: F401 + Evaluator, + _resolve_vero_dependency, + isolate_project, + run_evaluation, ) -from .core.task.utils import get_discover_cmd, get_run_cmd -from .exceptions import ExperimentRunFailedError -from .logging import setup_console -from .utils import run_subprocess_with_tee -from .workspace import Workspace -from .workspace.git import GitWorkspace - -console = setup_console() - -logger = logging.getLogger(__name__) - - -class Evaluator: - """Evaluates experiment runs by checking out commits and running tasks in subprocesses.""" - - def __init__( - self, - workspace: Workspace, - session_id: str, - *, - vero_home: Path | None = None, - use_copy: bool = False, - hooks: list[str] | None = None, - sync: bool = False, - subprocess_env_vars: list | Path | str | None = None, - task_project: Path | None = None, - task_module: str | None = None, - ): - self.workspace = workspace - self.session_id = session_id - self.vero_home = vero_home or get_vero_home_dir() - self.use_copy = use_copy - self.hooks = hooks if hooks is not None else ["setup_logging"] - self.sync = sync - self._subprocess_env_vars = subprocess_env_vars - self.task_project = task_project - self.task_module = task_module - self.on_experiment: list = [] # Callbacks fired after each evaluate() - - @property - def sessions_dir(self) -> Path: - return self.vero_home / "sessions" - - @property - def dataset_cache(self) -> Path: - return self.vero_home / "datasets" - - @property - def subprocess_env(self) -> dict[str, str] | None: - """Build subprocess env on demand from var names. Returns None to inherit os.environ.""" - if self._subprocess_env_vars is None: - return None - from vero.utils.subprocess_env import build_subprocess_env - - return build_subprocess_env(self._subprocess_env_vars) - - def _get_subprocess_env_with_vero_home(self) -> dict[str, str] | None: - """Build subprocess env and ensure VERO_HOME_DIR is set.""" - env = self.subprocess_env - if env is not None: - env["VERO_HOME_DIR"] = str(self.vero_home) - return env - - @staticmethod - def log_evaluation_results(result: ExperimentResult) -> None: - """Logs the evaluation results to the console.""" - stats = ( - result.sample_results_statistics( - as_dict=True, convert_lists_to_strings=True - ) - or {} - ) - if len(stats) > 0: - syntax = Syntax( - yaml.dump(stats, sort_keys=False), - "yaml", - theme="monokai", - line_numbers=False, - ) - console.print( - Panel( - syntax, - title="[bold green]⚙️ Evaluation Statistics[/bold green]", - border_style="green", - ) - ) - else: - console.print(f"No ExperimentResult found for run {result.run_id}.") - - def load_sample_results_from_cache( - self, evaluation_parameters: EvaluationParameters - ) -> dict[int, SampleResult]: - """Load the sample results from the cache. - - Tries to load from per-sample files first (new format), then falls back - to the single JSON file (legacy format) for backward compatibility. - """ - sample_results = load_all_sample_results( - self.sessions_dir, self.session_id, evaluation_parameters.result_id - ) - - if not sample_results: - logger.warning( - f"No sample results found for run {evaluation_parameters.run.id}." - ) - - return sample_results - - def _get_uv_params( - self, agent_project_path: Path | str - ) -> tuple[UvRunParameters, Path | str]: - """Build UvRunParameters and determine cwd for subprocess. - - When task_project is set, runs uv in the task project and layers - the agent code on top via --with-editable. Otherwise runs in the - agent project directly (backward compat). - - Returns: - (uv_params, cwd) tuple. - """ - if self.task_project: - return ( - UvRunParameters.from_env( - project=str(self.task_project), - with_editable=str(agent_project_path), - ), - self.task_project, - ) - return UvRunParameters.from_env( - project=str(agent_project_path) - ), agent_project_path - - async def _discover_tasks(self, project_path: Path | str) -> dict: - """Discover tasks via isolated subprocess. - - Args: - project_path: Path to the agent project. - - Returns: - Dictionary with package name and task metadata. - """ - uv_params, cwd = self._get_uv_params(project_path) - cmd = [*uv_params.get_cmd(), *get_discover_cmd(task_module=self.task_module)] - result = await run_subprocess_with_tee( - cmd, - timeout=60, - cwd=str(cwd), - flush=False, - tee_stdout=False, - env=self._get_subprocess_env_with_vero_home(), - ) - - if result.returncode != 0: - raise ExperimentRunFailedError( - f"Task discovery failed. Error: {result.stderr}.", - stdout=result.stdout, - stderr=result.stderr, - returncode=int(result.returncode), - ) - - return json.loads(result.stdout) - - async def _run_task( - self, - project_path: Path | str, - task_name: str, - params_file: Path, - timeout: int = 60 * 10, - ) -> dict | None: - """Execute task via isolated subprocess. - - Args: - project_path: Path to the user's project. - task_name: Name of the task to execute. - params_file: Path to JSON file containing EvaluationParameters. - timeout: Subprocess timeout in seconds. - - Returns: - Metrics dictionary from task execution, or None if parsing fails. - """ - uv_params, cwd = self._get_uv_params(project_path) - cmd = [ - *uv_params.get_cmd(), - *get_run_cmd( - task_name, params_file, hooks=self.hooks, task_module=self.task_module - ), - ] - result = await run_subprocess_with_tee( - cmd, - timeout=timeout, - cwd=cwd, - flush=True, - env=self._get_subprocess_env_with_vero_home(), - ) - logger.info("Subprocess complete!") - - # Save subprocess output for debugging - log_dir = params_file.parent - if result.stderr: - (log_dir / "subprocess_stderr.log").write_text(result.stderr) - if result.stdout: - (log_dir / "subprocess_stdout.log").write_text(result.stdout) - if result.returncode != 0: - (log_dir / "subprocess_returncode.txt").write_text(str(result.returncode)) - logger.warning( - f"Subprocess exited with code {result.returncode}. " - f"Stderr: {result.stderr[:500] if result.stderr else '(empty)'}" - ) - - # Read metrics from file (written by task subprocess) - metrics_path = log_dir / "metrics.json" - if metrics_path.exists(): - try: - return json.loads(metrics_path.read_text()) - except json.JSONDecodeError: - logger.warning(f"Failed to parse {metrics_path} as JSON") - return None - else: - logger.warning(f"Metrics file not found at {metrics_path}") - return None - - async def _run_task_in_subprocess( - self, - params: EvaluationParameters, - workspace: Workspace, - ) -> None: - """Run task via vero.task_utils subprocess. - - Args: - params: Evaluation parameters (must have task set). - workspace: Workspace to run in. - - Raises: - ExperimentRunFailedError: If task discovery or execution fails. - """ - - # Discover available tasks first - try: - discovery_result = await self._discover_tasks(workspace.project_path) - except Exception as e: - error_str = "".join(traceback.format_exception(type(e), e, e.__traceback__)) - raise ExperimentRunFailedError( - f"Task discovery failed. Error: {error_str}.", - stdout="", - stderr=error_str, - returncode=1, - ) - - # Validate the requested task exists - available_tasks = discovery_result.get("tasks", {}) - if params.task not in available_tasks: - available_names = list(available_tasks.keys()) - raise ExperimentRunFailedError( - f"Task '{params.task}' not found in package '{discovery_result.get('package', 'unknown')}'.\n" - f"Available tasks: {available_names if available_names else '(none found)'}\n" - f"Ensure your task is registered in vero_tasks/__init__.py", - stdout="", - stderr="", - returncode=1, - ) - - # Validate required environment variables - required_env = available_tasks[params.task].get("required_env_vars", []) - if required_env: - missing = [v for v in required_env if not os.environ.get(v)] - if missing: - raise ExperimentRunFailedError( - f"Task '{params.task}' requires environment variables that are not set: " - f"{', '.join(missing)}. Set them before running.", - stdout="", - stderr="", - returncode=1, - ) - - # Run the task - result_dir = get_experiment_dir( - self.sessions_dir, self.session_id, params.result_id - ) - params_file = result_dir / evaluation_parameters_basename - logger.info( - f"Running task '{params.task}' via vero.task_utils in {workspace.project_path}" - ) - try: - metrics = await self._run_task( - workspace.project_path, - params.task, - params_file, - timeout=params.timeout, - ) - logger.info(f"Task completed with metrics: {metrics}") - except Exception as e: - error_str = "".join(traceback.format_exception(type(e), e, e.__traceback__)) - raise ExperimentRunFailedError( - f"Task execution failed. Error: {error_str}.", - stdout="", - stderr=error_str, - returncode=1, - ) - - async def run( - self, - evaluation_parameters: EvaluationParameters, - use_copy: bool | None = None, - ) -> ExperimentResult: - """Run an experiment by checking out the candidate commit and running tasks via uv. - - Args: - evaluation_parameters: The parameters for the evaluation. - use_copy: Override for self.use_copy. If True, creates a temporary isolated copy - of the workspace (always clean). If False, uses the current workspace (requires clean state). - - Returns: - ExperimentResult with sample results and metadata. - """ - use_copy = use_copy if use_copy is not None else self.use_copy - - if not use_copy: - return await self._run_in_workspace(evaluation_parameters, self.workspace) - - async with self.workspace.temp_copy( - from_version=evaluation_parameters.run.candidate.commit, - ) as temp_workspace: - return await self._run_in_workspace(evaluation_parameters, temp_workspace) - - async def evaluate( - self, - commit: str, - dataset_id: str, - split: str, - task: str | None = None, - sample_ids: list[int] | None = None, - db: ExperimentDatabase | None = None, - evaluation_parameters: BaseEvaluationParameters | None = None, - use_copy: bool | None = None, - ) -> Experiment: - """Full evaluation lifecycle: resolve commit → run → create experiment → DB → hooks. - - This is the single entry point for all evaluations. Both Policy.evaluate_commit() - and ExperimentRunnerTool delegate here. - - Args: - commit: Git commit hash or ref to evaluate. - dataset_id: Dataset ID in the session store. - split: Dataset split to evaluate. - task: Task name to execute. - sample_ids: Specific sample IDs to evaluate (None = all). - db: ExperimentDatabase to record the experiment in. - evaluation_parameters: Base eval params (timeout, concurrency, etc.). - use_copy: Whether to create a temporary copy for the eval. - - Returns: - The completed Experiment with results. - """ - from .core.db.database import Experiment - - # Resolve commit ref to canonical version ID - try: - if isinstance(self.workspace, GitWorkspace): - full_hash = await self.workspace.resolve_ref(commit) - else: - full_hash = commit - except Exception as e: - raise ValueError( - f"Cannot resolve commit '{commit}': {e}. " - f"Make sure the commit exists in the repository." - ) - - # Build candidate - candidate = None - if db is not None: - candidate = db.get_candidate((self.workspace.name, full_hash)) - if candidate is None: - candidate = Candidate(commit=full_hash, repo_name=self.workspace.name) - - # Build run - dataset_subset = DatasetSubset( - split=split, sample_ids=sample_ids, dataset_id=dataset_id - ) - run = ExperimentRun(candidate=candidate, dataset_subset=dataset_subset) - - # Build eval params - base_params = evaluation_parameters or BaseEvaluationParameters() - params = EvaluationParameters( - **base_params.model_dump(), - run=run, - dataset_id=dataset_id, - task=task, - session_id=self.session_id, - ) - - # Run - result = await self.run(params, use_copy=use_copy) - - # Create experiment - experiment = Experiment(run=run, result=result) - - # Add to DB - if db is not None: - db.add_experiment(experiment) - - # Fire post-eval callbacks (may be sync or async) - import asyncio as _asyncio - - for callback in self.on_experiment: - try: - result = callback(experiment) - if _asyncio.iscoroutine(result): - await result - except Exception as e: - logger.warning(f"on_experiment callback failed: {e}") - - return experiment - - async def _run_in_workspace( - self, params: EvaluationParameters, workspace: Workspace - ) -> ExperimentResult: - """Run an experiment by checking out the candidate commit and running tasks via uv.""" - - # We cannot execute with a dirty workspace, as this may introduce side effects on the evaluation results. - if await workspace.is_dirty(): - raise RuntimeError( - "Evaluator cannot execute. There are unsaved changes in the workspace." - ) - - # Update the evaluation parameters with the dataset loader and session_id - params.session_id = self.session_id - - # Initialize the directory to store the evaluation and pytest report files - result_dir = initialize_result_store( - self.sessions_dir, self.session_id, params.result_id - ) - - save_json_to_cache( - self.sessions_dir, - self.session_id, - params.result_id, - basename=evaluation_parameters_basename, - data=params, - ) - logger.info( - f"Saved evaluation parameters to cache: {result_dir / evaluation_parameters_basename}" - ) - - # Git-specific: fetch from remote if configured - if self.sync and isinstance(workspace, GitWorkspace): - await workspace.maybe_fetch() - - # Clear any stale cached results before running to avoid reading old data if run fails - clear_result_cache( - self.sessions_dir, - self.session_id, - params.result_id, - result_basenames=[pytest_report_basename, evaluation_results_basename], - ) - - # Transfer data into the sandbox before running - experiment_dir = str( - get_experiment_dir(self.sessions_dir, self.session_id, params.result_id) - ) - await workspace.sandbox.upload(experiment_dir, experiment_dir) - - # Upload dataset cache so subprocess can load it - from vero.core.dataset.store import _read_mapping - - mapping = _read_mapping(self.sessions_dir, self.session_id) - dataset_fp = mapping.get(params.dataset_id or "") - if dataset_fp: - cache_entry = str(self.dataset_cache / dataset_fp) - await workspace.sandbox.upload(cache_entry, cache_entry) - # Also upload the session datasets.json mapping - session_dir = str(get_session_dir(self.sessions_dir, self.session_id)) - datasets_json = f"{session_dir}/datasets.json" - await workspace.sandbox.upload(datasets_json, datasets_json) - - # Switch to the candidate version and run the evaluation in a subprocess - async with workspace.at(params.run.candidate.commit): - await self._run_task_in_subprocess(params, workspace) - - # Transfer results back from the sandbox - await workspace.sandbox.download(experiment_dir, experiment_dir) - - sample_results = self.load_sample_results_from_cache(params) - - if not sample_results: - raise ExperimentRunFailedError( - f"No sample results found for run {params.run.id}! Likely because execution failed.", - returncode=1, - ) - else: - result = ExperimentResult.create_with_status( - id=params.result_id, - error_rate=params.error_rate_threshold, - run_id=params.run.id, - sample_results=sample_results, - ) - - # Write result metadata to disk so the DB can be reconstructed from experiments/ - save_json_to_cache( - self.sessions_dir, - self.session_id, - params.result_id, - basename=result_metadata_basename, - data={ - "id": result.id, - "run_id": result.run_id, - "status": result.status.value, - }, - ) - - self.log_evaluation_results(result) - return result - - -def _resolve_vero_dependency(isolated_dir: Path, original_project_dir: Path) -> None: - """Resolve the vero path dependency in pyproject.toml after isolation. - - When a project is isolated (copied to a new location), relative path - dependencies in ``[tool.uv.sources]`` break. This function resolves - the ``scale-vero`` dependency to an absolute path via ``uv add``. - - Raises ValueError if any *other* relative path dependencies are found, - since those are unsupported and would silently break. - """ - import subprocess - import tomllib - - pyproject_path = isolated_dir / "pyproject.toml" - if not pyproject_path.exists(): - return - - with open(pyproject_path, "rb") as f: - pyproject = tomllib.load(f) - - sources = pyproject.get("tool", {}).get("uv", {}).get("sources", {}) - if not sources: - return - - for name, source in sources.items(): - if not isinstance(source, dict) or "path" not in source: - continue - - rel_path = source["path"] - if not rel_path.startswith(".") and not rel_path.startswith("/"): - continue # Not a relative path - - if "vero" in name.lower(): - # Always resolve to the known vero package directory rather than - # trusting the relative path (which may be stale or wrong). - from vero.core.constants import PACKAGE_DIR - - abs_path = PACKAGE_DIR - editable_flag = ["--editable"] if source.get("editable") else [] - subprocess.run( - ["uv", "add", *editable_flag, "--dev", str(abs_path)], - cwd=isolated_dir, - capture_output=True, - check=True, - ) - logger.info(f"Resolved {name} dependency: {rel_path} -> {abs_path}") - else: - raise ValueError( - f"Unsupported relative path dependency '{name}' " - f"(path={rel_path!r}) in {pyproject_path}. " - f"Only vero is handled during isolation." - ) - - -def isolate_project( - project_path: Path | str, - session_id: str, - git_ref: str = "HEAD", - *, - sessions_dir: Path, -) -> Path: - """Copy a project into a fresh, standalone git repo. - - Useful when the project lives inside a monorepo or has uncommitted changes. - Extracts files at *git_ref* via ``git archive`` (falling back to a plain - copy when the source is not a git repo), then ``git init`` + ``git commit`` - so the result is a clean, self-contained repository. - - Relative path dependencies on vero in pyproject.toml are resolved to - absolute paths so they remain valid after the copy. - - Args: - project_path: Path to the project directory. - session_id: Session ID (isolated copy is placed under the session dir). - git_ref: Git ref to archive from (default: HEAD). - sessions_dir: Path to the sessions root directory. - - Returns: - Path to the isolated project root. - """ - import shutil - import subprocess - - project_path = Path(project_path).resolve() - isolated_dir = (sessions_dir / session_id) / project_path.name - isolated_dir.mkdir(parents=True, exist_ok=True) - - repo_root_result = subprocess.run( - ["git", "rev-parse", "--show-toplevel"], - cwd=project_path, - capture_output=True, - text=True, - ) - - if repo_root_result.returncode == 0: - repo_root_path = Path(repo_root_result.stdout.strip()) - project_rel = project_path.relative_to(repo_root_path) - strip = len(project_rel.parts) - - archive = subprocess.Popen( - ["git", "archive", git_ref, str(project_rel)], - cwd=repo_root_path, - stdout=subprocess.PIPE, - ) - subprocess.run( - ["tar", "xf", "-", "--strip-components", str(strip)], - cwd=isolated_dir, - stdin=archive.stdout, - check=True, - ) - archive.wait() - else: - shutil.copytree(project_path, isolated_dir, dirs_exist_ok=True) - - # Resolve vero dependency before git init (so it's in the initial commit) - _resolve_vero_dependency(isolated_dir, project_path) - - subprocess.run(["git", "init"], cwd=isolated_dir, capture_output=True, check=True) - subprocess.run( - ["git", "add", "."], cwd=isolated_dir, capture_output=True, check=True - ) - subprocess.run( - [ - "git", - "-c", - "user.name=vero", - "-c", - "user.email=vero@localhost", - "commit", - "-m", - "Initial commit (isolated)", - ], - cwd=isolated_dir, - capture_output=True, - check=True, - ) - - if repo_root_result.returncode == 0: - subprocess.run( - ["git", "remote", "add", "origin", repo_root_result.stdout.strip()], - cwd=isolated_dir, - capture_output=True, - ) - - logger.info(f"Isolated project: {project_path} -> {isolated_dir}") - return isolated_dir - - -async def run_evaluation( - project_path: Path | str, - dataset: str | Path, - split: str, - task: str | None = None, - commit: str | None = None, - sample_ids: list[int] | None = None, - num_samples: int | None = None, - task_params: dict | None = None, - seed: int = 42, - timeout: int = 3600, - per_sample_timeout: int = 180, - create_temporary_copy: bool = False, - isolate: bool = False, - hooks: list[str] | None = None, - session_id: str | None = None, - max_concurrency: int | None = None, - subprocess_env_vars: list[str] | Path | str | None = None, - task_project: Path | str | None = None, - task_module: str | None = None, - vero_home: Path | None = None, -) -> ExperimentResult: - """Run an evaluation using the given parameters. - - Args: - project_path: Path to the agent project to evaluate. - dataset: Dataset, DatasetDict, path to saved dataset, or dataset ID string. - split: Dataset split to evaluate. - task: Task name to execute from vero_tasks module. - commit: Commit to evaluate. - sample_ids: List of sample IDs to evaluate. - num_samples: Number of samples to evaluate. - task_params: Task-specific parameters for the evaluation. - seed: Random seed for sample selection. - timeout: Overall timeout for the evaluation subprocess in seconds. - per_sample_timeout: Timeout for a single sample in seconds. - create_temporary_copy: Whether to create a temporary copy for the evaluation. - isolate: Whether to copy the project into a fresh git repo before evaluating. - hooks: List of hook names to execute before task. - session_id: Session ID. - max_concurrency: Maximum concurrent tasks. - subprocess_env_vars: Environment variable names to pass to task subprocesses. - task_project: Path to a separate task project. When set, evaluator runs - uv in the task project and layers the agent via --with-editable. - task_module: Explicit Python module to import for task registration - (e.g. "my_eval_tasks.vero_tasks"). If None, auto-discovers. - vero_home: Path to the vero home directory. Defaults to ~/.vero. - - Returns: - The experiment result. - - Raises: - ExperimentRunFailedError: If the evaluation fails. - """ - from vero.core.dataset.store import resolve_and_save_dataset - - vh = vero_home or get_vero_home_dir() - sessions_dir = vh / "sessions" - dataset_cache = vh / "datasets" - - if task_params is None: - task_params = {} - - if session_id is None: - from uuid import uuid4 - - session_id = str(uuid4()) - logger.info(f"Auto-generated session ID: {session_id}") - - if isolate: - project_path = isolate_project( - project_path, session_id, sessions_dir=sessions_dir - ) - - workspace = await GitWorkspace.create(project_path) - - # Resolve and save dataset - dataset_id = resolve_and_save_dataset( - dataset, sessions_dir, dataset_cache, session_id - ) - - evaluator = Evaluator( - workspace=workspace, - use_copy=create_temporary_copy, - hooks=hooks, - session_id=session_id, - vero_home=vh, - subprocess_env_vars=subprocess_env_vars, - task_project=Path(task_project) if task_project else None, - task_module=task_module, - ) - - if commit is None: - commit = await workspace.current_version() - logger.warning(f"No commit provided, using current commit: {commit}.") - - # Sample data if num_samples is provided - if num_samples is not None and sample_ids is None: - from vero.core.dataset.store import load_dataset as _load_ds - - ds = _load_ds(sessions_dir, dataset_cache, session_id, dataset_id) - rng = random.Random(seed) - sample_ids = rng.sample(range(len(ds[split])), num_samples) - - # Build base eval params - eval_params = BaseEvaluationParameters( - timeout=timeout, - sample_timeout=per_sample_timeout, - task_params=task_params, - ) - if max_concurrency is not None: - eval_params.max_concurrency = max_concurrency - - experiment = await evaluator.evaluate( - commit=commit, - dataset_id=dataset_id, - split=split, - task=task, - sample_ids=sample_ids, - evaluation_parameters=eval_params, - use_copy=create_temporary_copy, - ) - result_dir = get_experiment_dir(sessions_dir, session_id, experiment.id) - console.print(f"Result available at {result_dir / samples_dir_name}") - return experiment.result +__all__ = ["Evaluator", "isolate_project", "run_evaluation", "_resolve_vero_dependency"] diff --git a/vero/src/vero/harbor/__init__.py b/vero/src/vero/harbor/__init__.py new file mode 100644 index 0000000..43df555 --- /dev/null +++ b/vero/src/vero/harbor/__init__.py @@ -0,0 +1,19 @@ +"""Harbor integration: the sidecar-specific frontend over the shared +EvaluationEngine, plus Mode B (Harbor-delegated eval). The `harbor` SDK is an +optional extra, imported lazily (only registry enumeration / nested runs need it — +config, dataset compilation, and the sidecar handlers do not). +""" + +from vero.harbor.config import HarborConfig +from vero.harbor.dataset import ( + build_harbor_dataset, + enumerate_local_task_names, + validate_partition, +) + +__all__ = [ + "HarborConfig", + "build_harbor_dataset", + "enumerate_local_task_names", + "validate_partition", +] diff --git a/vero/src/vero/harbor/app.py b/vero/src/vero/harbor/app.py new file mode 100644 index 0000000..e98bcce --- /dev/null +++ b/vero/src/vero/harbor/app.py @@ -0,0 +1,91 @@ +"""FastAPI app for the eval sidecar — the HTTP surface over the (transport-agnostic) +EvaluationSidecar handlers + the admin `finalize` over the Verifier. + +Two roles over one app: agent (`/eval`, `/submit`, `/status`; unauthenticated, metered, +redacted) and admin (`/finalize`; bearer-token gated). `vero harbor serve` runs +this under uvicorn in the eval-sidecar container. +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +from fastapi import FastAPI, Header, HTTPException +from fastapi.responses import JSONResponse +from pydantic import BaseModel + +from vero.evaluation.engine import EvalRequest +from vero.exceptions import ExperimentBudgetExceeded, InvalidSplitError +from vero.harbor.auth import check_admin +from vero.harbor.server import SubmitDisabledError +from vero.harbor.verifier import NoCandidateError + +if TYPE_CHECKING: + from vero.harbor.server import EvaluationSidecar + from vero.harbor.verifier import Verifier + + +class EvalBody(BaseModel): + dataset_id: str + split: str + commit: str | None = None + sample_ids: list[int] | None = None + num_samples: int | None = None + + +class SubmitBody(BaseModel): + commit: str | None = None + + +def create_app( + *, + sidecar: EvaluationSidecar, + verifier: Verifier, + admin_token: str, +) -> FastAPI: + app = FastAPI(title="vero eval sidecar") + + # Known errors -> agent-facing status codes. + app.add_exception_handler( + ExperimentBudgetExceeded, + lambda r, e: JSONResponse(status_code=429, content={"error": str(e)}), + ) + app.add_exception_handler( + InvalidSplitError, + lambda r, e: JSONResponse(status_code=400, content={"error": str(e)}), + ) + app.add_exception_handler( + SubmitDisabledError, + lambda r, e: JSONResponse(status_code=409, content={"error": str(e)}), + ) + app.add_exception_handler( + NoCandidateError, + lambda r, e: JSONResponse(status_code=409, content={"error": str(e)}), + ) + + @app.get("/health") + async def health(): + return {"ok": True} + + # --- agent endpoints (unauthenticated; metered + redacted) --- + @app.post("/eval") + async def eval_(body: EvalBody): + summary = await sidecar.evaluate(EvalRequest(**body.model_dump()), admin=False) + return summary.to_dict() + + @app.post("/submit") + async def submit(body: SubmitBody): + return await sidecar.submit(commit=body.commit) + + @app.get("/status") + async def status(): + return sidecar.status().to_dict() + + # --- admin endpoint (bearer-token gated) --- + @app.post("/finalize") + async def finalize(authorization: str | None = Header(default=None)): + if not check_admin(authorization, admin_token): + raise HTTPException(status_code=403, detail="admin token required") + return await verifier.finalize() + + return app diff --git a/vero/src/vero/harbor/auth.py b/vero/src/vero/harbor/auth.py new file mode 100644 index 0000000..8fafad9 --- /dev/null +++ b/vero/src/vero/harbor/auth.py @@ -0,0 +1,39 @@ +"""Admin-token auth for the eval sidecar. + +The token gates the admin `finalize` endpoint. It is generated per trial by the +sidecar and written `root:600` on a volume mounted into `main`, so the verifier +(root, shared mode) can read it but the optimizer (`agent.user`) cannot. The +optimizer therefore can only reach the agent endpoints, never `finalize`. +""" + +from __future__ import annotations + +import secrets +from pathlib import Path + +_BEARER = "Bearer " + + +def generate_token() -> str: + return secrets.token_urlsafe(32) + + +def write_admin_token(path: Path | str, token: str, *, mode: int = 0o600) -> Path: + """Write the token to ``path`` with restrictive perms (caller runs as root so the + file is root-owned and unreadable by ``agent.user``).""" + p = Path(path) + p.parent.mkdir(parents=True, exist_ok=True) + p.write_text(token) + p.chmod(mode) + return p + + +def read_admin_token(path: Path | str) -> str: + return Path(path).read_text().strip() + + +def check_admin(authorization: str | None, expected_token: str) -> bool: + """Constant-time check of an ``Authorization: Bearer `` header.""" + if not authorization or not authorization.startswith(_BEARER): + return False + return secrets.compare_digest(authorization[len(_BEARER):], expected_token) diff --git a/vero/src/vero/harbor/build/__init__.py b/vero/src/vero/harbor/build/__init__.py new file mode 100644 index 0000000..17711fd --- /dev/null +++ b/vero/src/vero/harbor/build/__init__.py @@ -0,0 +1,6 @@ +"""The `vero harbor build` compiler: BuildConfig -> a runnable Harbor task dir.""" + +from vero.harbor.build.compiler import compile_task +from vero.harbor.build.config import BuildConfig + +__all__ = ["BuildConfig", "compile_task"] diff --git a/vero/src/vero/harbor/build/compiler.py b/vero/src/vero/harbor/build/compiler.py new file mode 100644 index 0000000..6151ca0 --- /dev/null +++ b/vero/src/vero/harbor/build/compiler.py @@ -0,0 +1,265 @@ +"""The `vero harbor build` compiler: BuildConfig -> a runnable Harbor task dir. + +Emits the environment (optimizer workbench `main` + eval `eval-sidecar`), the +protocol (instruction.md), the verifier (tests/test.sh -> `vero harbor finalize`), +and bakes the ServeConfig + dataset + baseline repo + vero source. The result runs +with `harbor run -p -a -m -e docker`. +""" + +from __future__ import annotations + +import logging +import re +import shutil +import subprocess +from pathlib import Path + +from jinja2 import Environment, FileSystemLoader + +from vero.harbor.build.config import BuildConfig + +logger = logging.getLogger(__name__) + +_TEMPLATES = Path(__file__).parent / "templates" + +# Container paths (must match the templates). +VERO_DIR = "/opt/vero" +AGENT_BASELINE = "/opt/agent-baseline" # sidecar engine workspace +WORK_AGENT = "/work/agent" # shared agent repo (main rw, sidecar ro) +VERO_HOME = "/opt/vero_home" +INNER_TASK = "/opt/inner-task" # Mode B: baked inner Harbor task (the protected benchmark) +SERVE_JSON = "/opt/serve.json" +ADMIN_VOLUME = "/state/admin" +AGENT_VOLUME = "/state/agent-results" +TOKEN_PATH = "/state/token/admin.token" +SESSION_ID = "trial" + +# vero source items copied into the build context (enough to `uv pip install`). +_VERO_COPY = ["pyproject.toml", "README.md", "uv.lock", "src"] + + +def _render(env: Environment, template_name: str, dest: Path, **ctx) -> None: + dest.parent.mkdir(parents=True, exist_ok=True) + dest.write_text(env.get_template(template_name).render(**ctx)) + + +def _copy_vero_source(vero_root: Path, dest: Path) -> None: + dest.mkdir(parents=True, exist_ok=True) + for item in _VERO_COPY: + src = vero_root / item + if not src.exists(): + continue + if src.is_dir(): + shutil.copytree(src, dest / item, dirs_exist_ok=True) + else: + shutil.copy2(src, dest / item) + + +def _rewrite_vero_source_path(pyproject: Path) -> None: + """Point a relative `scale-vero` path dependency at the baked /opt/vero so it + resolves regardless of where the repo (or a temp worktree of it) lives.""" + if not pyproject.exists(): + return + text = pyproject.read_text() + new = re.sub( + r'(scale-vero\s*=\s*\{[^}]*?path\s*=\s*")[^"]*(")', + rf"\g<1>{VERO_DIR}\g<2>", + text, + ) + if new != text: + pyproject.write_text(new) + logger.info("Rewrote scale-vero source path -> %s", VERO_DIR) + + +def _prepare_baseline_repo(agent_repo: Path, dest: Path) -> str: + """Materialize the target repo at HEAD into a clean standalone git repo + (vero path rewritten) and return its commit sha. Copied verbatim (incl. .git) + into both the sidecar (engine workspace) and main (seed), so they share a sha.""" + dest.mkdir(parents=True, exist_ok=True) + toplevel = subprocess.run( + ["git", "-C", str(agent_repo), "rev-parse", "--show-toplevel"], + capture_output=True, text=True, + ) + if toplevel.returncode == 0: + # Extract only the target subtree at HEAD (the repo may be a monorepo and + # agent_repo a subdirectory of it), stripping the leading path components. + repo_root = Path(toplevel.stdout.strip()) + rel = agent_repo.relative_to(repo_root) + strip = len(rel.parts) + archive = subprocess.Popen( + ["git", "-C", str(repo_root), "archive", "HEAD", str(rel)] + if strip else ["git", "-C", str(repo_root), "archive", "HEAD"], + stdout=subprocess.PIPE, + ) + subprocess.run( + ["tar", "xf", "-", "--strip-components", str(strip)], + cwd=dest, stdin=archive.stdout, check=True, + ) + archive.wait() + else: + shutil.copytree(agent_repo, dest, dirs_exist_ok=True) + + _rewrite_vero_source_path(dest / "pyproject.toml") + + def git(*args: str) -> str: + return subprocess.run( + ["git", "-c", "user.name=vero", "-c", "user.email=vero@localhost", + "-C", str(dest), *args], + capture_output=True, text=True, check=True, + ).stdout.strip() + + git("init", "-q") + git("add", "-A") + git("commit", "-q", "-m", "baseline") + return git("rev-parse", "HEAD") + + +def _register(dataset, vero_home: Path, tmp: Path) -> str: + """Register a dataset (path/DatasetDict) into a baked VERO_HOME; return dataset_id.""" + from vero.core.dataset.store import resolve_and_save_dataset + + sessions = vero_home / "sessions" + datasets = vero_home / "datasets" + (sessions / SESSION_ID).mkdir(parents=True, exist_ok=True) + datasets.mkdir(parents=True, exist_ok=True) + if not isinstance(dataset, str): # a DatasetDict -> save_to_disk first + path = tmp / "ds" + dataset.save_to_disk(str(path)) + dataset = str(path) + return resolve_and_save_dataset(dataset, sessions, datasets, SESSION_ID) + + +def _serve_config(config: BuildConfig, dataset_id: str | None, base_commit: str) -> dict: + harbor = None + if config.harbor is not None: + # Local inner task -> baked sidecar-only path; registry ref -> pass through. + harbor = {**config.harbor} + if config.inner_task: + harbor["task_source"] = INNER_TASK + targets = [ + { + "task": config.task, + "dataset_id": dataset_id, + "split": t.split, + "reward_key": t.reward_key, + "sample_ids": t.sample_ids, + } + for t in config.targets + ] + return { + "repo_path": AGENT_BASELINE, + "agent_repo_path": WORK_AGENT, + "session_id": SESSION_ID, + "dataset_id": dataset_id, + "split_accesses": [s.model_dump() for s in config.splits], + "budgets": [ + {"split": b.split, "dataset_id": dataset_id, **b.model_dump(exclude={"split"}, exclude_none=True)} + for b in config.budgets + ], + "task": config.task, + "task_project": config.task_project, + "task_module": config.task_module, + "harbor": harbor, + "reward_mode": config.reward_mode, + "selection_split": config.selection_split, + "targets": targets, + "base_commit": base_commit, + "submit_enabled": config.submit_enabled, + "agent_volume": AGENT_VOLUME, + "admin_volume": ADMIN_VOLUME, + "admin_token_path": TOKEN_PATH, + "timeout": config.timeout, + "sample_timeout": config.sample_timeout, + "max_concurrency": config.max_concurrency, + "host": "0.0.0.0", + "port": 8000, + } + + +def compile_task( + config: BuildConfig, out_dir: Path | str, *, vero_root: Path | None = None +) -> Path: + """Compile ``config`` into a Harbor task directory at ``out_dir``.""" + import json + + from vero.core.constants import PACKAGE_DIR + + vero_root = vero_root or PACKAGE_DIR + out = Path(out_dir) + if out.exists(): + shutil.rmtree(out) + env_dir = out / "environment" + env_dir.mkdir(parents=True) + + agent_repo = Path(config.agent_repo).resolve() + + # 1. vero source (both images install from here) + _copy_vero_source(vero_root, env_dir / "vero") + + # 2. baseline repo -> sidecar engine workspace + main seed (shared sha) + base_commit = _prepare_baseline_repo(agent_repo, env_dir / "agent-baseline") + shutil.copytree(env_dir / "agent-baseline", env_dir / "agent-seed") + + # 3. dataset -> baked VERO_HOME. Mode A: input+label rows. Mode B: the + # {split: [task_names]} partition + the inner Harbor task baked sidecar-only. + import tempfile + + vh = env_dir / "sidecar" / "vero_home" + tmp = Path(tempfile.mkdtemp()) + if config.mode == "A": + if not config.dataset: + raise ValueError("Mode A requires a dataset.") + dataset_id = _register(config.dataset, vh, tmp) + else: + if not (config.partition and config.harbor): + raise ValueError("Mode B requires partition + harbor.") + if not (config.inner_task or config.harbor.get("task_source")): + raise ValueError("Mode B requires inner_task (local) or harbor.task_source (registry).") + from vero.harbor.dataset import build_harbor_dataset + + dataset_id = _register(build_harbor_dataset(config.partition), vh, tmp) + if config.inner_task: # local benchmark -> bake sidecar-only + shutil.copytree(Path(config.inner_task).resolve(), env_dir / "sidecar" / "inner-task") + + # 4. ServeConfig (compiler <-> serve contract) + (env_dir / "sidecar" / "serve.json").write_text( + json.dumps(_serve_config(config, dataset_id, base_commit), indent=2) + ) + + # 5. render templates + jenv = Environment( + loader=FileSystemLoader(str(_TEMPLATES)), + keep_trailing_newline=True, + trim_blocks=True, + lstrip_blocks=True, + ) + ctx = dict( + name=config.name, + description=config.description, + mode=config.mode, + timeout=config.timeout, + secrets=config.secrets, + read_only_paths=config.read_only_paths, + base_image_main=config.base_image_main, + base_image_sidecar=config.base_image_sidecar, + dataset_id=dataset_id, + selection_split=config.selection_split, + submit_enabled=config.submit_enabled, + eval_num_samples=None, + bake_inner_task=bool(config.inner_task), + ) + _render(jenv, "task.toml.j2", out / "task.toml", **ctx) + _render(jenv, "instruction.md.j2", out / "instruction.md", **ctx) + _render(jenv, "docker-compose.yaml.j2", env_dir / "docker-compose.yaml", **ctx) + _render(jenv, "Dockerfile.main.j2", env_dir / "Dockerfile", **ctx) + _render(jenv, "Dockerfile.sidecar.j2", env_dir / "sidecar" / "Dockerfile", **ctx) + _render(jenv, "seed.sh.j2", env_dir / "main" / "seed.sh", **ctx) + _render(jenv, "test.sh.j2", out / "tests" / "test.sh", **ctx) + _render(jenv, "solve.sh.j2", out / "solution" / "solve.sh", **ctx) + + for script in [out / "tests" / "test.sh", out / "solution" / "solve.sh", + env_dir / "main" / "seed.sh"]: + script.chmod(0o755) + + logger.info("Compiled Harbor task -> %s (baseline %s)", out, base_commit[:12]) + return out diff --git a/vero/src/vero/harbor/build/config.py b/vero/src/vero/harbor/build/config.py new file mode 100644 index 0000000..7be37b7 --- /dev/null +++ b/vero/src/vero/harbor/build/config.py @@ -0,0 +1,97 @@ +"""`BuildConfig` — the `vero harbor build -c build.yaml` schema. + +Everything the compiler needs to emit a Harbor optimization task. Mode A (vero +runs inference + scoring) and Mode B (nested `harbor run`) share one topology; +the differences are which extras the sidecar bakes and which secrets it needs. +""" + +from __future__ import annotations + +from pathlib import Path +from typing import Literal + +import yaml +from pydantic import BaseModel, Field + + +class SplitAccessSpec(BaseModel): + split: str + access: Literal["viewable", "non_viewable", "no_access"] + + +class BudgetSpec(BaseModel): + split: str + total_run_budget: int | None = None + total_sample_budget: int | None = None + + +class TargetSpec(BaseModel): + """A scoring target the verifier evaluates the selected commit on.""" + + split: str + reward_key: str = "reward" + sample_ids: list[int] | None = None + + +class BuildConfig(BaseModel): + """Inputs to `vero harbor build`.""" + + # identity + name: str = Field(description="Harbor task name, 'org/name' format.") + description: str = "" + + # the target repo the optimizer edits (baseline in main + sidecar) + agent_repo: str + + # mode A (scoring in vero): task name + dataset (+ optional separate task project) + mode: Literal["A", "B"] = "A" + task: str | None = None + task_project: str | None = None + task_module: str | None = None + dataset: str | None = Field( + default=None, description="Path to a saved DatasetDict (Mode A)." + ) + + # mode B (scoring in nested harbor): HarborConfig kwargs (task_source filled by the + # compiler from inner_task), the {split: [task_names]} partition, and the inner + # Harbor task dir baked sidecar-only (the protected benchmark, mirrors Mode A's dataset). + harbor: dict | None = None + partition: dict[str, list[str]] | None = None + inner_task: str | None = None + + # tiers / budget / reward + splits: list[SplitAccessSpec] + budgets: list[BudgetSpec] = Field(default_factory=list) + reward_mode: Literal["submit", "auto_best"] = "auto_best" + selection_split: str = "validation" + targets: list[TargetSpec] = Field(default_factory=list) + submit_enabled: bool = False + + # write-access: paths in the target repo the optimizer may NOT edit + # (the scorer, by default). Applied as unix perms in main before the agent runs. + read_only_paths: list[str] = Field(default_factory=list) + + # secrets resolved from the host and injected into the SIDECAR only + secrets: list[str] = Field(default_factory=lambda: ["OPENAI_API_KEY"]) + + # image bases + base_image_main: str = "ghcr.io/astral-sh/uv:python3.12-bookworm" + base_image_sidecar: str = "ghcr.io/astral-sh/uv:python3.12-bookworm" + + # eval params baked into the ServeConfig + timeout: int = 1800 + sample_timeout: int = 300 + max_concurrency: int = 8 + + @classmethod + def from_file(cls, path: Path | str) -> BuildConfig: + path = Path(path).resolve() + data = yaml.safe_load(path.read_text()) + # Resolve relative local-path fields against the build.yaml's directory, so a + # config is portable regardless of the working directory it's built from. + base = path.parent + for field in ("agent_repo", "dataset", "inner_task"): + val = data.get(field) + if isinstance(val, str) and not Path(val).is_absolute(): + data[field] = str((base / val).resolve()) + return cls.model_validate(data) diff --git a/vero/src/vero/harbor/build/templates/Dockerfile.main.j2 b/vero/src/vero/harbor/build/templates/Dockerfile.main.j2 new file mode 100644 index 0000000..0861553 --- /dev/null +++ b/vero/src/vero/harbor/build/templates/Dockerfile.main.j2 @@ -0,0 +1,20 @@ +# main: the optimizer's workbench. Harbor installs the `-a` optimizer agent here +# and runs it against instruction.md. Holds the target repo (rw, minus locked +# paths) + the `vero` CLI client. Runs the container as root (for seed + verifier); +# the optimizer is exec'd as the de-privileged `agent` user. +FROM {{ base_image_main }} + +RUN apt-get update \ + && apt-get install -y --no-install-recommends git ca-certificates curl \ + && rm -rf /var/lib/apt/lists/* + +# vero + CLI client (eval / submit / status / finalize over VERO_EVAL_URL) +COPY vero /opt/vero +RUN uv pip install --system "/opt/vero[harbor]" + +# baseline target repo (seeded onto the shared volume at start) + the seed script +COPY agent-seed /opt/agent-seed +COPY main/seed.sh /opt/seed.sh +RUN chmod +x /opt/seed.sh && useradd -m -u 1001 agent + +WORKDIR /work/agent diff --git a/vero/src/vero/harbor/build/templates/Dockerfile.sidecar.j2 b/vero/src/vero/harbor/build/templates/Dockerfile.sidecar.j2 new file mode 100644 index 0000000..7eea688 --- /dev/null +++ b/vero/src/vero/harbor/build/templates/Dockerfile.sidecar.j2 @@ -0,0 +1,29 @@ +# eval-sidecar: the evaluation engine. Holds the dataset + scoring + baseline repo +# + ledger + creds. Runs `vero harbor serve` (HTTP). Secrets reach this container +# only (compose); the admin volume is never mounted to main. +FROM {{ base_image_sidecar }} + +RUN apt-get update \ + && apt-get install -y --no-install-recommends git ca-certificates \ + && rm -rf /var/lib/apt/lists/* + +COPY vero /opt/vero +RUN uv pip install --system "/opt/vero[harbor]" + +# baseline repo = the engine's GitWorkspace (fetches the optimizer's commits from +# the ro-mounted /work/agent); baked vero_home (registered dataset{% if mode == 'A' %} + scoring{% endif %}). +COPY agent-baseline /opt/agent-baseline +COPY sidecar/vero_home /opt/vero_home +COPY sidecar/serve.json /opt/serve.json +{% if bake_inner_task %} +# inner Harbor task (the protected benchmark the candidate agent is run against) +COPY sidecar/inner-task /opt/inner-task +{% endif %} + +# warm the uv cache so eval-time `uv run --project ` resolves offline-fast +RUN cd /opt/agent-baseline && uv sync 2>/dev/null || true + +# allow the engine to fetch from the ro-mounted agent repo (different owner) +RUN git config --system --add safe.directory '*' + +WORKDIR /opt diff --git a/vero/src/vero/harbor/build/templates/docker-compose.yaml.j2 b/vero/src/vero/harbor/build/templates/docker-compose.yaml.j2 new file mode 100644 index 0000000..78f026c --- /dev/null +++ b/vero/src/vero/harbor/build/templates/docker-compose.yaml.j2 @@ -0,0 +1,45 @@ +# Merged LAST by Harbor over its build template (which auto-configures `main` +# from environment/Dockerfile). We add the eval-sidecar + volumes and wire main. +services: + main: + # Run as root so the seed step can chown the repo and the verifier (shared + # mode) can read the root:600 admin token. Harbor execs the optimizer as the + # [agent].user ("agent") declared in task.toml. + command: ["/opt/seed.sh"] + environment: + VERO_EVAL_URL: "http://eval-sidecar:8000" + volumes: + - agent_repo:/work/agent + - agent_results:/state/agent-results:ro + - token_state:/state/token:ro + depends_on: + eval-sidecar: + condition: service_healthy + + eval-sidecar: + build: + context: . + dockerfile: sidecar/Dockerfile + command: ["vero", "harbor", "serve", "--config", "/opt/serve.json"] + environment: + VERO_HOME_DIR: "/opt/vero_home" +{% for secret in secrets %} + {{ secret }}: "${{ '{' }}{{ secret }}{{ '}' }}" +{% endfor %} + volumes: + - agent_repo:/work/agent:ro + - agent_results:/state/agent-results + - admin_state:/state/admin + - token_state:/state/token + healthcheck: + test: ["CMD", "python", "-c", "import urllib.request,sys; sys.exit(0 if urllib.request.urlopen('http://localhost:8000/health').status==200 else 1)"] + interval: 5s + timeout: 10s + retries: 30 + start_period: 10s + +volumes: + agent_repo: + agent_results: + admin_state: + token_state: diff --git a/vero/src/vero/harbor/build/templates/instruction.md.j2 b/vero/src/vero/harbor/build/templates/instruction.md.j2 new file mode 100644 index 0000000..1e11430 --- /dev/null +++ b/vero/src/vero/harbor/build/templates/instruction.md.j2 @@ -0,0 +1,28 @@ +# Optimization task + +You are optimizing the code in `/work/agent`. Improve it so it scores as high as +possible on a **hidden test split** — but you never see the test split. You measure +progress on the splits you *are* allowed to evaluate, within a fixed budget. + +## Workflow + +1. Edit the repo at `/work/agent`. Some paths are read-only (the scorer) — leave them. +2. Commit your changes (`git commit`). +3. Measure a commit on an allowed split: + + ``` + vero harbor eval --dataset-id {{ dataset_id }} --split {{ selection_split }} + ``` + + (defaults to your current `HEAD`). Returns an aggregate score and remaining budget. +4. Check budget / which splits are evaluable anytime: `vero harbor status`. +{% if submit_enabled %}5. When done, nominate your best commit: `vero harbor submit`.{% else %} +The best commit you evaluate on `{{ selection_split }}` is selected automatically and +scored on the hidden test split at the end.{% endif %} + +## Rules + +- Budget is finite and metered per split — spend it wisely. +- The test split is hidden: you cannot evaluate it, and its labels never reach this + container. Trying to read it will fail. +- The scorer is locked. Only the eval sidecar scores. diff --git a/vero/src/vero/harbor/build/templates/seed.sh.j2 b/vero/src/vero/harbor/build/templates/seed.sh.j2 new file mode 100644 index 0000000..c284211 --- /dev/null +++ b/vero/src/vero/harbor/build/templates/seed.sh.j2 @@ -0,0 +1,21 @@ +#!/bin/sh +# Seed the optimizer's working repo onto the shared volume and apply write-access +# rules, then keep `main` alive. Runs as root at container start. +set -e + +if [ ! -d /work/agent/.git ]; then + cp -a /opt/agent-seed/. /work/agent/ +fi + +# Whole repo is the optimizer's to edit... +chown -R agent:agent /work/agent +git config --system --add safe.directory /work/agent +{% for p in read_only_paths %} +# ...except locked paths (e.g. the scorer): root-owned + unwritable. +if [ -e "/work/agent/{{ p }}" ]; then + chown -R root:root "/work/agent/{{ p }}" + chmod -R a-w "/work/agent/{{ p }}" +fi +{% endfor %} + +exec sleep infinity diff --git a/vero/src/vero/harbor/build/templates/solve.sh.j2 b/vero/src/vero/harbor/build/templates/solve.sh.j2 new file mode 100644 index 0000000..bc97e5e --- /dev/null +++ b/vero/src/vero/harbor/build/templates/solve.sh.j2 @@ -0,0 +1,17 @@ +#!/bin/bash +# Oracle "optimizer" used for the e2e smoke test: make one trivial edit, commit, +# and measure it on the selection split. The auto-best verifier then scores the +# selected commit on the hidden test split. A real optimizer agent replaces this. +set -ex +cd /work/agent +git config user.email optimizer@example.com +git config user.name optimizer + +# A no-op-ish "improvement" so there is a non-baseline commit to select. +echo "# optimizer touch" >> README.md 2>/dev/null || echo "# optimizer touch" > NOTES.md +git add -A +git commit -m "optimizer candidate" + +vero harbor eval --dataset-id {{ dataset_id }} --split {{ selection_split }}{% if eval_num_samples %} --num-samples {{ eval_num_samples }}{% endif %} + +vero harbor status diff --git a/vero/src/vero/harbor/build/templates/task.toml.j2 b/vero/src/vero/harbor/build/templates/task.toml.j2 new file mode 100644 index 0000000..c037e22 --- /dev/null +++ b/vero/src/vero/harbor/build/templates/task.toml.j2 @@ -0,0 +1,23 @@ +schema_version = "1.3" + +[task] +name = "{{ name }}" +description = "{{ description }}" + +[agent] +# The optimizer runs as a de-privileged user so it cannot read the admin token +# (root:600) or the admin volume. It edits the target repo + calls `vero harbor eval`. +user = "agent" + +[verifier] +# Shared mode: Harbor runs tests/test.sh in `main` with the whole env (incl. the +# eval-sidecar) still up. The verifier runs as root, reads the admin token, and +# calls the sidecar's `finalize` endpoint to score the selected commit. +environment_mode = "shared" +timeout_sec = {{ timeout }} + +[environment] +# Compose-based environment: environment/docker-compose.yaml adds the eval-sidecar +# service + volumes and wires `main`. Secrets are injected into the sidecar only +# (see the compose file), never declared here (this section's env reaches `main`). +build_timeout_sec = 1800 diff --git a/vero/src/vero/harbor/build/templates/test.sh.j2 b/vero/src/vero/harbor/build/templates/test.sh.j2 new file mode 100644 index 0000000..f65e477 --- /dev/null +++ b/vero/src/vero/harbor/build/templates/test.sh.j2 @@ -0,0 +1,10 @@ +#!/bin/bash +# Verifier (shared mode, root). Reads the admin token (root:600, unreadable by the +# optimizer) and asks the eval sidecar to select + score the commit on the hidden +# test split, writing the reward. +set -e +mkdir -p /logs/verifier +vero harbor finalize \ + --token-file /state/token/admin.token \ + --output /logs/verifier/reward.json +cat /logs/verifier/reward.json diff --git a/vero/src/vero/harbor/cli.py b/vero/src/vero/harbor/cli.py new file mode 100644 index 0000000..e68ce4c --- /dev/null +++ b/vero/src/vero/harbor/cli.py @@ -0,0 +1,127 @@ +"""`vero harbor` CLI. + +Thin clients the optimizer and verifier use inside the compiled task: + - agent (in `main`): eval / submit / status -> POST/GET the sidecar over VERO_EVAL_URL + - verifier (in `main`): finalize -> POST /finalize with the admin token, + write /logs/verifier/reward.json +`serve` (sidecar entry) and `build`/`run` (host-side compiler) are added with stage (c). +""" + +from __future__ import annotations + +import json +import os +from pathlib import Path + +import click + + +def _base_url() -> str: + url = os.environ.get("VERO_EVAL_URL") + if not url: + raise click.ClickException("VERO_EVAL_URL is not set (the eval sidecar URL).") + return url.rstrip("/") + + +def _request(method: str, path: str, *, payload: dict | None = None, headers: dict | None = None): + import httpx + + resp = httpx.request( + method, f"{_base_url()}{path}", json=payload, headers=headers or {}, timeout=None + ) + if resp.status_code >= 400: + raise click.ClickException(f"{method} {path} -> {resp.status_code}: {resp.text}") + return resp.json() + + +@click.group() +def harbor() -> None: + """Vero ⇄ Harbor: optimization-as-a-Harbor-task commands.""" + + +@harbor.command("serve") +@click.option("--config", "config_path", required=True, help="Path to the ServeConfig JSON.") +def serve_cmd(config_path): + """Eval-sidecar entrypoint: assemble the engine/sidecar/verifier and serve (uvicorn).""" + from vero.harbor.serve import serve + + serve(config_path) + + +@harbor.command("build") +@click.option("-c", "--config", "config_path", required=True, help="Path to build.yaml.") +@click.option("-o", "--out", required=True, help="Output task directory.") +def build_cmd(config_path, out): + """Compile a build.yaml into a runnable Harbor optimization task directory.""" + from vero.harbor.build import BuildConfig, compile_task + + task_dir = compile_task(BuildConfig.from_file(config_path), out) + click.echo(f"Compiled task -> {task_dir}") + + +@harbor.command("run", context_settings={"ignore_unknown_options": True}) +@click.option("-c", "--config", "config_path", required=True, help="Path to build.yaml.") +@click.option("-a", "--agent", required=True, help="Optimizer agent (passed to harbor run).") +@click.option("-m", "--model", default=None, help="Model for the optimizer agent.") +@click.option("-e", "--environment", "provider", default="docker", show_default=True) +@click.argument("extra", nargs=-1, type=click.UNPROCESSED) +def run_cmd(config_path, agent, model, provider, extra): + """Build to a temp dir, then `harbor run` it (build + run convenience).""" + import subprocess + import tempfile + + from vero.harbor.build import BuildConfig, compile_task + + task_dir = compile_task(BuildConfig.from_file(config_path), Path(tempfile.mkdtemp()) / "task") + cmd = ["uvx", "harbor", "run", "-p", str(task_dir), "-a", agent, "-e", provider] + if model: + cmd += ["-m", model] + cmd += list(extra) + click.echo(f"$ {' '.join(cmd)}") + raise SystemExit(subprocess.call(cmd)) + + +@harbor.command("eval") +@click.option("--dataset-id", required=True) +@click.option("--split", required=True) +@click.option("--commit", default=None, help="Defaults to the agent repo HEAD.") +@click.option("--num-samples", type=int, default=None) +@click.option("--sample-ids", default=None, help="Comma-separated sample ids.") +def eval_cmd(dataset_id, split, commit, num_samples, sample_ids): + """Spend one evaluation of your commit on a split (agent).""" + payload: dict = {"dataset_id": dataset_id, "split": split} + if commit: + payload["commit"] = commit + if num_samples is not None: + payload["num_samples"] = num_samples + if sample_ids: + payload["sample_ids"] = [int(x) for x in sample_ids.split(",")] + click.echo(json.dumps(_request("POST", "/eval", payload=payload), indent=2)) + + +@harbor.command("submit") +@click.option("--commit", default=None, help="Defaults to the agent repo HEAD.") +def submit_cmd(commit): + """Nominate a commit and end the optimization run (agent; if enabled).""" + click.echo(json.dumps(_request("POST", "/submit", payload={"commit": commit}), indent=2)) + + +@harbor.command("status") +def status_cmd(): + """Show remaining budget, evaluable splits, and whether submit is enabled (agent).""" + click.echo(json.dumps(_request("GET", "/status"), indent=2)) + + +@harbor.command("finalize") +@click.option("--token-file", required=True, help="Path to the admin token (root:600).") +@click.option("--output", default="/logs/verifier/reward.json", show_default=True) +def finalize_cmd(token_file, output): + """Verifier: select the best/submitted commit, score on the test split, write reward.json (admin).""" + from vero.harbor.auth import read_admin_token + + token = read_admin_token(token_file) + reward = _request("POST", "/finalize", headers={"Authorization": f"Bearer {token}"}) + out = Path(output) + out.parent.mkdir(parents=True, exist_ok=True) + out.write_text(json.dumps(reward)) + click.echo(json.dumps(reward, indent=2)) diff --git a/vero/src/vero/harbor/config.py b/vero/src/vero/harbor/config.py new file mode 100644 index 0000000..da5f56e --- /dev/null +++ b/vero/src/vero/harbor/config.py @@ -0,0 +1,35 @@ +"""HarborConfig — the Mode-B configuration. + +User-facing config that turns "evaluate my agent on a set of Harbor tasks" into a +`harbor run` invocation. A typed projection of the user-controllable `harbor run` +flags; the per-eval-derived flags (task selection, jobs dir, source/agent resolution) +are filled in by the runner, not here. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from pathlib import Path + + +@dataclass +class HarborConfig: + task_source: str # registry ref "org/name[@ver]" OR a local path to a task dir/dataset + agent_import_path: str # module path to the candidate agent, e.g. "pkg.mod:Class" + model: str | None = None + environment: str = "modal" # cloud provider (docker allowed for local testing) + n_attempts: int = 1 + max_retries: int = 2 + reward_key: str | None = None # primary reward; default pass -> reward -> mean + extra_args: list[str] = field(default_factory=list) # passthrough harbor run flags + + @property + def is_registry(self) -> bool: + """Local if the source resolves to an existing path; otherwise a registry ref.""" + return not Path(self.task_source).expanduser().exists() + + def source_args(self) -> list[str]: + """`harbor run` source selector: `-d ` (registry) or `-p ` (local).""" + if self.is_registry: + return ["-d", self.task_source] + return ["-p", str(Path(self.task_source).expanduser())] diff --git a/vero/src/vero/harbor/dataset.py b/vero/src/vero/harbor/dataset.py new file mode 100644 index 0000000..c9ececa --- /dev/null +++ b/vero/src/vero/harbor/dataset.py @@ -0,0 +1,80 @@ +"""Build the vero dataset (task-name references + split partition) for Mode B. + +A Mode-B vero dataset has no labels — each "sample" is a Harbor task name. A local +task's name is its subdirectory name (the dir containing ``task.toml``), matching what +``harbor run -i/--include-task-name`` filters on; registry task names come from the +registry's task configs. + +The split partition is a ``dict[str, list[str]]`` (e.g. ``{"train": [...], "test": [...]}``) +supplied by the benchmark author; this module compiles + validates it into a DatasetDict. +""" + +from __future__ import annotations + +from pathlib import Path +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + from datasets import DatasetDict + + +def build_harbor_dataset(partition: dict[str, list[str]]) -> DatasetDict: + """Compile a ``{split: [task_names]}`` partition into a vero DatasetDict. + + Each split is a single-column (`task_name`) Dataset — the label-free sample + references Mode B evaluates. + """ + from datasets import Dataset, DatasetDict + + if not partition: + raise ValueError("Harbor dataset partition is empty.") + return DatasetDict( + {split: Dataset.from_dict({"task_name": list(names)}) for split, names in partition.items()} + ) + + +def enumerate_local_task_names(task_source: str | Path) -> list[str]: + """Task names available in a local Harbor task source. + + If the path is itself a task dir (contains ``task.toml``), returns ``[dir_name]``; + otherwise returns the names of immediate subdirectories that contain ``task.toml``. + """ + path = Path(task_source).expanduser() + if (path / "task.toml").exists(): + return [path.name] + if not path.is_dir(): + raise ValueError(f"Local task source is not a directory: {path}") + return sorted( + d.name for d in path.iterdir() if d.is_dir() and (d / "task.toml").exists() + ) + + +async def enumerate_registry_task_names( + ref: str, *, registry_url: str | None = None +) -> list[str]: + """Task names in a registry dataset (``org/name[@version]``). + + Lazy-imports the ``harbor`` SDK (the ``harbor`` extra) — registry resolution is a + build-time concern, not a sidecar-runtime one. Integration-verified. + """ + from harbor.models.job.config import RegistryDatasetConfig + from harbor.models.registry import RemoteRegistryInfo + + name, _, version = ref.partition("@") + config = RegistryDatasetConfig( + registry=RemoteRegistryInfo(url=registry_url) if registry_url else None, + name=name, + version=version or None, + ) + return sorted(tc.path.name for tc in await config.get_task_configs()) + + +def validate_partition(partition: dict[str, list[str]], available: list[str]) -> None: + """Raise if the partition references task names not in ``available``.""" + avail = set(available) + referenced = {name for names in partition.values() for name in names} + unknown = referenced - avail + if unknown: + raise ValueError( + f"Partition references task names not found in the source: {sorted(unknown)}" + ) diff --git a/vero/src/vero/harbor/protocol.py b/vero/src/vero/harbor/protocol.py new file mode 100644 index 0000000..eeb6a4e --- /dev/null +++ b/vero/src/vero/harbor/protocol.py @@ -0,0 +1,106 @@ +"""Wire types for the eval sidecar's HTTP frontend, and the redaction that +projects a full Experiment down to what the agent may see. + +`EvalRequest` (the request) lives in `vero.evaluation.engine` — it is shared with +the in-process tool. The *response* types here are sidecar-specific: they are +aggregate-safe by construction (never per-sample), because per-sample detail is +delivered as files on the agent-readable volume, gated by split tier. +""" + +from __future__ import annotations + +from dataclasses import asdict, dataclass, field + +from vero.core.budget import SplitBudget +from vero.core.dataset.base import SplitAccess, SplitAccessLevel +from vero.core.db.database import Experiment + + +@dataclass +class EvalSummary: + """Aggregate-safe response to an agent evaluate call. + + Carries no per-sample data. Per-sample detail (for visible splits) and + summary stats (for partial splits) are written to the agent-readable volume + at `result_path`; nothing is written there for no_access splits. + """ + + commit: str + split: str + dataset_id: str + n_samples: int + mean_score: float | None + result_path: str | None # where on the agent volume to read detail (None if nothing written) + budget_remaining: SplitBudget | None = None + + def to_dict(self) -> dict: + d = asdict(self) + if self.budget_remaining is not None: + d["budget_remaining"] = asdict(self.budget_remaining) + return d + + +@dataclass +class StatusSummary: + """Response to a status call. `submit_enabled` (not the verifier-internal + selection strategy) is what the agent needs to know.""" + + submit_enabled: bool + # per (split, dataset_id): tier + whether the agent may evaluate it + remaining budget + splits: list[dict] = field(default_factory=list) + + def to_dict(self) -> dict: + return asdict(self) + + +def tier_for_split(split: str, split_accesses: list[SplitAccess]) -> SplitAccessLevel: + """Resolve a split's visibility tier (default: viewable when unlisted).""" + for sa in split_accesses: + if sa.split == split: + return sa.access + return SplitAccessLevel.viewable + + +def summarize_experiment( + experiment: Experiment, + *, + result_path: str | None, + budget_remaining: SplitBudget | None = None, +) -> EvalSummary: + """Project a full Experiment to an aggregate-safe EvalSummary.""" + return EvalSummary( + commit=experiment.run.candidate.commit, + split=experiment.run.dataset_subset.split, + dataset_id=experiment.run.dataset_subset.dataset_id, + n_samples=len(experiment.result.sample_results), + mean_score=experiment.result.score(), + result_path=result_path, + budget_remaining=budget_remaining, + ) + + +def build_status( + *, + submit_enabled: bool, + budget: dict[tuple[str, str], SplitBudget], + split_accesses: list[SplitAccess], +) -> StatusSummary: + """Build the agent-facing status from the budget ledger + split tiers. + + Only budgeted (split, dataset_id) pairs are listed — those are exactly what + the agent may evaluate. no_access splits are not in the agent ledger. + """ + splits = [] + for (split, dataset_id), b in budget.items(): + tier = tier_for_split(split, split_accesses) + splits.append( + { + "split": split, + "dataset_id": dataset_id, + "tier": str(tier), + "agent_evaluable": tier != SplitAccessLevel.no_access, + "remaining_sample_budget": b.remaining_sample_budget, + "remaining_run_budget": b.remaining_run_budget, + } + ) + return StatusSummary(submit_enabled=submit_enabled, splits=splits) diff --git a/vero/src/vero/harbor/runner.py b/vero/src/vero/harbor/runner.py new file mode 100644 index 0000000..ca11a5d --- /dev/null +++ b/vero/src/vero/harbor/runner.py @@ -0,0 +1,220 @@ +"""HarborRunner — the Mode-B evaluation strategy. + +Implements ``EvalStrategy``: for a checked-out candidate, runs a nested ``harbor run`` +(in the candidate's own uv env) over the Harbor tasks selected by the split/sample_ids, +then collates the jobs dir into vero ``SampleResult``s. One Harbor task = one sample. + +Shells out to the ``harbor`` CLI (no harbor import here) and reads trial ``result.json`` +as plain dicts, so ``vero`` itself needs no ``harbor`` dependency at runtime. +""" + +from __future__ import annotations + +import json +import logging +from pathlib import Path +from typing import TYPE_CHECKING + +from vero.core.db.dataset import DatasetSample +from vero.core.db.result import SampleResult +from vero.core.sessions import ( + get_vero_home_dir, + load_sample_result, + save_sample_result, +) +from vero.harbor.config import HarborConfig +from vero.utils import run_subprocess_with_tee + +if TYPE_CHECKING: + from vero.core.evaluation import EvaluationParameters + from vero.workspace import Workspace + +logger = logging.getLogger(__name__) + + +class HarborRunner: + """Mode-B EvalStrategy: nested `harbor run` + collate -> SampleResults.""" + + def __init__(self, config: HarborConfig): + self.config = config + + async def produce_sample_results( + self, + *, + workspace: Workspace, + params: EvaluationParameters, + result_dir: Path, + ) -> None: + pairs = self._task_names_for(params) # [(sample_id, task_name), ...] + if not pairs: + return + jobs_dir = Path(result_dir) / "jobs" + + # Resume: only run tasks without an already-persisted SampleResult. + pending = [(sid, t) for sid, t in pairs if self._existing(params, sid) is None] + if pending: + await self._run_harbor( + str(workspace.project_path), params, [t for _, t in pending], jobs_dir + ) + self._collate(jobs_dir, pairs, params) + + # ------------------------------------------------------------------ + # Task selection (host-side; just task names) + # ------------------------------------------------------------------ + + def _task_names_for(self, params: EvaluationParameters) -> list[tuple[int, str]]: + from vero.core.dataset.store import load_dataset + + vero_home = get_vero_home_dir() + dataset = load_dataset( + vero_home / "sessions", + vero_home / "datasets", + params.session_id, + params.run.dataset_subset.dataset_id, + ) + split = dataset[params.run.dataset_subset.split] + ids = params.run.dataset_subset.sample_ids + if ids is None: + ids = list(range(len(split))) + return [(i, split[i]["task_name"]) for i in ids] + + # ------------------------------------------------------------------ + # Execute + # ------------------------------------------------------------------ + + def _build_command( + self, + project_path: str, + params: EvaluationParameters, + task_names: list[str], + jobs_dir: Path, + ) -> list[str]: + c = self.config + cmd = [ + "uv", "run", "--project", project_path, + "harbor", "run", + *c.source_args(), + "--agent-import-path", c.agent_import_path, + "-e", c.environment, + "-n", str(params.max_concurrency), + ] + if c.model: + cmd += ["-m", c.model] + for task_name in task_names: + cmd += ["-i", task_name] + cmd += ["--jobs-dir", str(jobs_dir), *c.extra_args] + return cmd + + async def _run_harbor( + self, + project_path: str, + params: EvaluationParameters, + task_names: list[str], + jobs_dir: Path, + ) -> None: + cmd = self._build_command(project_path, params, task_names, jobs_dir) + logger.info(f"Mode B: {' '.join(cmd)}") + result = await run_subprocess_with_tee( + cmd, timeout=params.timeout, cwd=project_path + ) + # Non-zero is not fatal: partial trials may still exist; collation fills gaps. + if result.returncode != 0: + logger.warning( + f"`harbor run` exited {result.returncode}: " + f"{(result.stderr or '')[:500]}" + ) + + # ------------------------------------------------------------------ + # Collate + # ------------------------------------------------------------------ + + def _collate( + self, + jobs_dir: Path, + pairs: list[tuple[int, str]], + params: EvaluationParameters, + ) -> None: + trials = self._load_trials(jobs_dir) # {task_name: result_dict} + for sample_id, task_name in pairs: + if self._existing(params, sample_id) is not None: + continue # already collated (resume) + sample_result = self._sample_result( + trials.get(task_name), sample_id, task_name, params + ) + save_sample_result( + get_vero_home_dir() / "sessions", + params.session_id, + params.result_id, + sample_id=sample_id, + result=sample_result, + ) + + def _load_trials(self, jobs_dir: Path) -> dict[str, dict]: + trials: dict[str, dict] = {} + if not jobs_dir.exists(): + return trials + # Trial result.json files live at ///result.json; the + # job-level //result.json carries no task_name, so recurse and + # key on task_name (skipping the job summary). + for result_json in jobs_dir.rglob("result.json"): + try: + data = json.loads(result_json.read_text()) + except (json.JSONDecodeError, OSError): + continue + task_name = data.get("task_name") + if task_name: + trials[task_name] = data + return trials + + def _sample_result( + self, + trial: dict | None, + sample_id: int, + task_name: str, + params: EvaluationParameters, + ) -> SampleResult: + common = { + "dataset_sample": DatasetSample( + sample_id=sample_id, + split=params.run.dataset_subset.split, + dataset_id=params.run.dataset_subset.dataset_id, + ), + "commit": params.run.candidate.commit, + "result_id": params.result_id, + } + if trial is None: + return SampleResult( + error=f"No Harbor trial result for task '{task_name}'.", **common + ) + rewards = (trial.get("verifier_result") or {}).get("rewards") or {} + if not rewards: + return SampleResult( + error=f"No verifier rewards for task '{task_name}'.", + output={"task_name": task_name, "trial_name": trial.get("trial_name")}, + **common, + ) + return SampleResult( + score=self._extract_reward(rewards), + metrics={k: float(v) for k, v in rewards.items()}, + output={ + "task_name": task_name, + "trial_name": trial.get("trial_name"), + "rewards": rewards, + }, + **common, + ) + + def _extract_reward(self, rewards: dict) -> float: + for key in (self.config.reward_key, "pass", "reward"): + if key and key in rewards: + return float(rewards[key]) + values = [float(v) for v in rewards.values()] + return sum(values) / len(values) if values else 0.0 + + def _existing(self, params: EvaluationParameters, sample_id: int) -> SampleResult | None: + return load_sample_result( + get_vero_home_dir() / "sessions", + params.session_id, + params.result_id, + sample_id, + ) diff --git a/vero/src/vero/harbor/serve.py b/vero/src/vero/harbor/serve.py new file mode 100644 index 0000000..ec2e98f --- /dev/null +++ b/vero/src/vero/harbor/serve.py @@ -0,0 +1,170 @@ +"""`vero harbor serve` — the eval-sidecar entrypoint. + +Assembles the EvaluationEngine + EvaluationSidecar + Verifier from a ServeConfig +(written by the compiler, baked into the sidecar image), generates the per-trial admin +token, and serves the FastAPI app under uvicorn. ServeConfig is the compiler↔serve +contract. +""" + +from __future__ import annotations + +import logging +from pathlib import Path + +from pydantic import BaseModel, Field + +from vero.core.budget import BudgetLedger, SplitBudget +from vero.core.dataset.base import SplitAccess, SplitAccessLevel +from vero.core.db.database import ExperimentDatabase +from vero.core.evaluation import BaseEvaluationParameters +from vero.core.sessions import get_vero_home_dir +from vero.evaluation.engine import EvaluationEngine +from vero.evaluation.evaluator import Evaluator +from vero.harbor.app import create_app +from vero.harbor.auth import generate_token, write_admin_token +from vero.harbor.server import EvaluationSidecar +from vero.harbor.verifier import VerificationTarget, Verifier +from vero.workspace.git import GitWorkspace + +logger = logging.getLogger(__name__) + + +class _SplitAccessCfg(BaseModel): + split: str + access: str # "viewable" | "non_viewable" | "no_access" + + +class _TargetCfg(BaseModel): + task: str | None = None + dataset_id: str + split: str + reward_key: str = "reward" + sample_ids: list[int] | None = None + + +class ServeConfig(BaseModel): + """Everything the sidecar needs to assemble itself. Baked by the compiler.""" + + repo_path: str # sidecar's own repo (baseline target) = the engine workspace + agent_repo_path: str # mounted agent workspace (commit-transfer source) + session_id: str + dataset_id: str # already registered in the sidecar's VERO_HOME + split_accesses: list[_SplitAccessCfg] + budgets: list[dict] # SplitBudget kwargs + + # Mode A + task: str | None = None + task_project: str | None = None + task_module: str | None = None + # Mode B + harbor: dict | None = None # HarborConfig kwargs + + # selection / reward + reward_mode: str = "auto_best" + selection_split: str = "validation" + targets: list[_TargetCfg] = Field(default_factory=list) + base_commit: str | None = None + submit_enabled: bool = False + + # volumes / token + agent_volume: str + admin_volume: str + admin_token_path: str + + # eval params + timeout: int = 600 + sample_timeout: int = 180 + max_concurrency: int = 20 + use_copy: bool = True # isolate each eval in a temp copy (clean tree, concurrency-safe) + + host: str = "0.0.0.0" + port: int = 8000 + + @classmethod + def from_file(cls, path: Path | str) -> ServeConfig: + return cls.model_validate_json(Path(path).read_text()) + + +async def build_components(config: ServeConfig) -> tuple[EvaluationSidecar, Verifier, str]: + """Assemble the sidecar + verifier (sharing one engine) and the admin token.""" + vero_home = get_vero_home_dir() + workspace = await GitWorkspace.create(config.repo_path) + + budget = BudgetLedger( + [SplitBudget(**b) for b in config.budgets], + persist_path=Path(config.admin_volume) / "ledger.json", + ) + + eval_strategy = None + if config.harbor is not None: + from vero.harbor.runner import HarborRunner + from vero.harbor.config import HarborConfig + + eval_strategy = HarborRunner(HarborConfig(**config.harbor)) + + evaluator = Evaluator( + workspace, + config.session_id, + vero_home=vero_home, + use_copy=config.use_copy, + task_project=Path(config.task_project) if config.task_project else None, + task_module=config.task_module, + eval_strategy=eval_strategy, + ) + + db = ExperimentDatabase(id=config.session_id) # shared by engine (writes) + verifier (reads) + engine = EvaluationEngine( + evaluator=evaluator, + budget=budget, + default_task=config.task, + db=db, + run_constraints=BaseEvaluationParameters( + timeout=config.timeout, + sample_timeout=config.sample_timeout, + max_concurrency=config.max_concurrency, + ), + session_id=config.session_id, + vero_home=vero_home, + ) + + split_accesses = [ + SplitAccess(split=s.split, access=SplitAccessLevel(s.access)) + for s in config.split_accesses + ] + sidecar = EvaluationSidecar( + engine=engine, + split_accesses=split_accesses, + agent_repo_path=Path(config.agent_repo_path), + agent_volume=Path(config.agent_volume), + admin_volume=Path(config.admin_volume), + submit_enabled=config.submit_enabled, + ) + verifier = Verifier( + engine=engine, + admin_volume=Path(config.admin_volume), + reward_mode=config.reward_mode, # type: ignore[arg-type] + targets=[VerificationTarget(**t.model_dump()) for t in config.targets], + selection_split=config.selection_split, + base_commit=config.base_commit, + ) + + token = generate_token() + write_admin_token(config.admin_token_path, token) + return sidecar, verifier, token + + +async def build_app(config: ServeConfig): + sidecar, verifier, token = await build_components(config) + return create_app(sidecar=sidecar, verifier=verifier, admin_token=token) + + +def serve(config_path: Path | str) -> None: + """Sidecar entrypoint: build the app and run it under uvicorn.""" + import asyncio + + import uvicorn + + config = ServeConfig.from_file(config_path) + app = asyncio.run(build_app(config)) + logger.info(f"Serving eval sidecar on {config.host}:{config.port}") + uvicorn.run(app, host=config.host, port=config.port) diff --git a/vero/src/vero/harbor/server.py b/vero/src/vero/harbor/server.py new file mode 100644 index 0000000..a4c3f86 --- /dev/null +++ b/vero/src/vero/harbor/server.py @@ -0,0 +1,186 @@ +"""EvaluationSidecar: the privileged, transport-agnostic frontend over the +EvaluationEngine, plus the trust-boundary mechanics that only exist in the Harbor +sidecar — commit transfer from the mounted agent repo and tier-gated +write-routing of results across the two volumes. + +The HTTP binding (`serve()`) is a thin shell added when the `vero harbor serve` +CLI lands; these handlers are framework-agnostic and unit-testable on their own. +""" + +from __future__ import annotations + +import json +import logging +from dataclasses import replace +from pathlib import Path + +from vero.core.dataset.base import SplitAccess, SplitAccessLevel +from vero.core.db.database import Experiment +from vero.evaluation.engine import EvalRequest, EvaluationEngine +from vero.exceptions import InvalidSplitError +from vero.harbor.protocol import ( + EvalSummary, + StatusSummary, + build_status, + summarize_experiment, + tier_for_split, +) + +logger = logging.getLogger(__name__) + + +class CommitTransferError(RuntimeError): + """Raised when a commit cannot be fetched from the agent's mounted repo.""" + + +class SubmitDisabledError(RuntimeError): + """Raised when submit() is called but the task does not use submit selection.""" + + +class EvaluationSidecar: + """Agent-facing handlers over the EvaluationEngine. + + Wraps the engine with: commit transfer (mounted agent repo -> sidecar repo), + result write-routing by split tier, and aggregate-safe responses. The engine + meters agent calls (admin calls bypass). + """ + + def __init__( + self, + *, + engine: EvaluationEngine, + split_accesses: list[SplitAccess], + agent_repo_path: Path, + agent_volume: Path, + admin_volume: Path, + submit_enabled: bool = False, + ): + self.engine = engine + self.split_accesses = split_accesses + self.agent_repo_path = Path(agent_repo_path) + self.agent_volume = Path(agent_volume) + self.admin_volume = Path(admin_volume) + self.submit_enabled = submit_enabled + + # ------------------------------------------------------------------ + # Handlers (the HTTP layer resolves `admin` from auth and calls these) + # ------------------------------------------------------------------ + + async def evaluate(self, req: EvalRequest, *, admin: bool = False) -> EvalSummary: + sha = await self._transfer_commit(req.commit) + exp = await self.engine.evaluate(replace(req, commit=sha), admin=admin) + result_path = self._route_results(exp, admin=admin) + budget_remaining = None + if not admin: + try: + budget_remaining = self.engine.budget.get(req.dataset_id, req.split) + except InvalidSplitError: + pass + return summarize_experiment( + exp, result_path=result_path, budget_remaining=budget_remaining + ) + + async def submit(self, commit: str | None = None) -> dict: + """Record the agent's nominated commit; terminal. No score returned.""" + if not self.submit_enabled: + raise SubmitDisabledError( + "This task does not use submit-based selection; submit is disabled." + ) + sha = await self._transfer_commit(commit) + self.admin_volume.mkdir(parents=True, exist_ok=True) + (self.admin_volume / "submission.json").write_text( + json.dumps({"commit": sha}, indent=2) + ) + return {"submitted_commit": sha} + + def status(self) -> StatusSummary: + return build_status( + submit_enabled=self.submit_enabled, + budget=self.engine.budget.status(), + split_accesses=self.split_accesses, + ) + + # ------------------------------------------------------------------ + # Trust-boundary mechanics + # ------------------------------------------------------------------ + + async def _transfer_commit(self, ref: str | None) -> str: + """Fetch ``ref`` (default agent HEAD) from the mounted agent repo into the + sidecar's own repo and return its resolved sha. + + The agent repo is untrusted: hooks are disabled and ``file://`` forces an + object copy (no hardlink/alternates) so the fetched commit is fully owned + by the sidecar repo and tamper-evident. + """ + workspace = self.engine.evaluator.workspace + root = workspace.root + target = ref or "HEAD" + fetch = await workspace.sandbox.run( + [ + "git", + "-c", + "core.hooksPath=/dev/null", + "-c", + "protocol.file.allow=always", + "-C", + root, + "fetch", + "--no-tags", + "--no-recurse-submodules", + f"file://{self.agent_repo_path}", + target, + ], + timeout=120, + ) + if fetch.returncode != 0: + raise CommitTransferError( + f"git fetch of {target!r} from agent repo failed: {fetch.stderr}" + ) + rev = await workspace.sandbox.run( + ["git", "-C", root, "rev-parse", "FETCH_HEAD"], timeout=30 + ) + if rev.returncode != 0: + raise CommitTransferError(f"rev-parse FETCH_HEAD failed: {rev.stderr}") + return rev.stdout.strip() + + def _route_results(self, experiment: Experiment, *, admin: bool) -> str | None: + """Write the agent-visible projection of an experiment by split tier. + + Full per-sample results always live admin-side (the session store). Here we + write only what the agent may see: + - visible: aggregate summary + full per-sample results + - non_viewable: aggregate summary only (no per-sample / no labels) + - no_access: nothing + Admin/verifier evals never write to the agent volume. + Returns the agent-volume path written, or None. + """ + if admin: + return None + split = experiment.run.dataset_subset.split + tier = tier_for_split(split, self.split_accesses) + if tier == SplitAccessLevel.no_access: + return None + + commit = experiment.run.candidate.commit + dest = self.agent_volume / "results" / f"{split}__{commit[:12]}" + dest.mkdir(parents=True, exist_ok=True) + + # Aggregate summary is label-safe for both visible and partial tiers. + (dest / "summary.json").write_text( + json.dumps( + { + "split": split, + "commit": commit, + "n_samples": len(experiment.result.sample_results), + "mean_score": experiment.result.score(), + "status": str(experiment.result.status), + }, + indent=2, + ) + ) + if tier == SplitAccessLevel.viewable: + for sample_id, sample_result in experiment.result.sample_results.items(): + (dest / f"{sample_id}.json").write_text( + sample_result.model_dump_json(indent=2) + ) + return str(dest) diff --git a/vero/src/vero/harbor/verifier.py b/vero/src/vero/harbor/verifier.py new file mode 100644 index 0000000..c641d8c --- /dev/null +++ b/vero/src/vero/harbor/verifier.py @@ -0,0 +1,114 @@ +"""Verifier: admin-side commit selection + hidden-split scoring -> reward. + +Runs at trial end. In the shared-verifier deployment the eval sidecar is still +up, so the verifier (root, in the `main` container) reaches this logic through +the sidecar's token-gated ``finalize`` endpoint, sharing the engine's state +(repo, dataset, scoring, ledger, submission record). It selects the candidate +commit (submit: the agent's nominated commit | auto_best: the best commit on the +selection split, excluding the baseline) and scores it on a configured battery +of targets, emitting a multi-key reward dict that the wiring writes to Harbor's +reward.json. +""" + +from __future__ import annotations + +import json +import logging +from dataclasses import dataclass +from pathlib import Path +from typing import Literal + +from vero.core.constants import default_minimum_score +from vero.evaluation.engine import EvaluationEngine + +logger = logging.getLogger(__name__) + + +class NoCandidateError(RuntimeError): + """Raised when no commit can be selected (no submission / no experiments).""" + + +@dataclass +class VerificationTarget: + """One scoring target -> one named reward in reward.json.""" + + task: str | None # None in Mode B (the nested harbor strategy ignores the vero task) + dataset_id: str + split: str + reward_key: str + sample_ids: list[int] | None = None # None = full split + + +class Verifier: + def __init__( + self, + *, + engine: EvaluationEngine, + admin_volume: Path, + reward_mode: Literal["submit", "auto_best"], + targets: list[VerificationTarget], + selection_split: str = "validation", + base_commit: str | None = None, + ): + self.engine = engine + self.admin_volume = Path(admin_volume) + self.reward_mode = reward_mode + self.targets = targets + self.selection_split = selection_split + self.base_commit = base_commit + + async def finalize(self) -> dict[str, float]: + """Select the commit and score it on every target -> {reward_key: score}.""" + sha = self._select_commit() + logger.info(f"Verifier selected commit {sha} (mode={self.reward_mode})") + rewards: dict[str, float] = {} + for target in self.targets: + exp = await self.engine.evaluate_admin( + task=target.task, + dataset_id=target.dataset_id, + split=target.split, + commit=sha, + sample_ids=target.sample_ids, + ) + score = exp.result.score() + rewards[target.reward_key] = ( + float(score) if score is not None else default_minimum_score + ) + return rewards + + def _select_commit(self) -> str: + if self.reward_mode == "submit": + return self._submitted_commit() + return self._best_from_db() + + def _submitted_commit(self) -> str: + path = self.admin_volume / "submission.json" + if not path.exists(): + raise NoCandidateError( + "submit mode but no submission.json — the agent never submitted a commit." + ) + commit = json.loads(path.read_text()).get("commit") + if not commit: + raise NoCandidateError("submission.json has no commit.") + return commit + + def _best_from_db(self) -> str: + """Best candidate by recorded score on the selection split (excludes baseline).""" + if self.engine.db is None: + raise NoCandidateError("auto_best mode but no experiment database.") + df = self.engine.db.get_experiments_df(fill_score=default_minimum_score) + if df.empty or "dataset_subset_split" not in df.columns: + raise NoCandidateError("auto_best mode but no experiments recorded.") + + split_df = df[df["dataset_subset_split"] == self.selection_split] + if self.base_commit is not None: + split_df = split_df[split_df["candidate_commit"] != self.base_commit] + if len(split_df) == 0: + raise NoCandidateError( + f"auto_best mode but no candidate experiments on split " + f"'{self.selection_split}'." + ) + best = split_df.sort_values( + by=["mean_score", "candidate_created_at"], ascending=[False, False] + ).iloc[0] + return best["candidate_commit"] diff --git a/vero/src/vero/policy.py b/vero/src/vero/policy.py index 5e630b9..247dc1a 100644 --- a/vero/src/vero/policy.py +++ b/vero/src/vero/policy.py @@ -1,6 +1,5 @@ from __future__ import annotations -import asyncio import json import logging from dataclasses import dataclass, field @@ -17,7 +16,7 @@ ) from vero.core.db.database import Experiment, ExperimentDatabase from vero.core.evaluation import BaseEvaluationParameters -from vero.evaluator import Evaluator +from vero.evaluation.evaluator import Evaluator from vero.filesystem import AccessRule, AccessType from vero.logging import SessionLogger, log_experiments_to_wandb from vero.sandbox import Sandbox @@ -32,6 +31,8 @@ from datasets import DatasetDict from jinja2 import Template + from vero.harbor.config import HarborConfig + DatasetT = Path | str | DatasetDict logger = logging.getLogger(__name__) @@ -140,6 +141,11 @@ class Policy: # --- Sandbox --- sandbox: Sandbox | None = None + # --- Harbor (Mode B) --- + # When set, evaluation runs a nested `harbor run` (the agent-under-test on the + # configured Harbor tasks) instead of vero-native inference/scoring. + harbor: HarborConfig | None = None + # --- Storage --- vero_home: Path | str | None = None @@ -221,7 +227,7 @@ async def init(self) -> None: # Git workspace — create via sandbox.run() git commands project_path = Path(self.project_path) if self.isolate: - from vero.evaluator import isolate_project + from vero.evaluation.evaluator import isolate_project project_path = isolate_project( project_path, self.session_id, self.ref, sessions_dir=self.sessions_dir @@ -337,6 +343,13 @@ async def init(self) -> None: self._validate_budget_splits() self.session.budget = self.budget + # Mode B: inject a HarborRunner strategy when a HarborConfig is set. + eval_strategy = None + if self.harbor is not None: + from vero.harbor.runner import HarborRunner + + eval_strategy = HarborRunner(self.harbor) + # Evaluator — with explicit subprocess env self.session.evaluator = Evaluator( self.session.workspace, @@ -345,6 +358,7 @@ async def init(self) -> None: subprocess_env_vars=self.subprocess_env_vars, task_project=Path(self.task_project) if self.task_project else None, task_module=self.task_module, + eval_strategy=eval_strategy, ) # Register artifact callbacks on evaluator so they fire for all eval paths diff --git a/vero/src/vero/session.py b/vero/src/vero/session.py index 3a9b4de..d86bb7d 100644 --- a/vero/src/vero/session.py +++ b/vero/src/vero/session.py @@ -8,7 +8,7 @@ from vero.core.dataset import SplitAccess from vero.core.db import ExperimentDatabase from vero.core.evaluation import BaseEvaluationParameters -from vero.evaluator import Evaluator +from vero.evaluation.evaluator import Evaluator from vero.tools.experiment_runner import SplitBudget # noqa: E402 — direct import avoids tools/__init__.py from vero.workspace import Workspace diff --git a/vero/src/vero/tools/experiment_runner.py b/vero/src/vero/tools/experiment_runner.py index c393146..a3820da 100644 --- a/vero/src/vero/tools/experiment_runner.py +++ b/vero/src/vero/tools/experiment_runner.py @@ -2,90 +2,30 @@ import logging from dataclasses import dataclass, field +from pathlib import Path from typing import Callable, NoReturn +from vero.core.budget import BudgetLedger, SplitBudget from vero.core.db.database import Experiment, ExperimentDatabase from vero.core.evaluation import BaseEvaluationParameters -from vero.evaluator import Evaluator +from vero.evaluation.evaluator import Evaluator from vero.exceptions import ( ExperimentBudgetExceeded, ExperimentRunFailedError, - InvalidSplitError, ) +from vero.evaluation.engine import EvalRequest, EvaluationEngine from vero.tools.utils import is_tool logger = logging.getLogger(__name__) +# SplitBudget moved to vero.core.budget; re-exported here for the public import path. +__all__ = ["ExperimentRunnerTool", "SplitBudget"] + def _default_on_fatal(msg: str) -> NoReturn: raise RuntimeError(msg) -@dataclass -class SplitBudget: - """A stateful object that tracks the remaining budget for running experiments.""" - - split: str - dataset_id: str = "" - total_sample_budget: int | None = None - remaining_sample_budget: int | None = field(init=False) - total_run_budget: int | None = None - remaining_run_budget: int | None = field(init=False) - max_samples_per_run: int | None = None - - def __repr__(self) -> str: - repr_items = [ - ("split", self.split), - ("dataset_id", self.dataset_id), - ("total_sample_budget", self.total_sample_budget), - ("total_run_budget", self.total_run_budget), - ] - repr_items = [item for item in repr_items if item[1] is not None] - return ( - f"SplitBudget({', '.join([f'{item[0]}={item[1]}' for item in repr_items])})" - ) - - def __post_init__(self): - assert ( - self.total_sample_budget is not None or self.total_run_budget is not None - ), "Either total sample budget or total run budget must be provided." - self.remaining_sample_budget = self.total_sample_budget - self.remaining_run_budget = self.total_run_budget - - assert ( - isinstance(self.total_sample_budget, int) - or self.total_sample_budget is None - ) - assert isinstance(self.total_run_budget, int) or self.total_run_budget is None - assert ( - isinstance(self.max_samples_per_run, int) - or self.max_samples_per_run is None - ) - - def has_run_budget(self) -> bool: - return self.remaining_run_budget is None or self.remaining_run_budget > 0 - - def decrement_run_budget(self) -> None: - if self.remaining_run_budget is not None: - self.remaining_run_budget -= 1 - - def has_sample_budget(self, num_samples: int) -> bool: - return ( - self.remaining_sample_budget is None - or self.remaining_sample_budget >= num_samples - ) - - def decrement_sample_budget(self, num_samples: int) -> None: - if self.remaining_sample_budget is not None: - self.remaining_sample_budget -= num_samples - - def exceeds_per_run_budget(self, num_samples: int) -> bool: - return ( - self.max_samples_per_run is not None - and num_samples > self.max_samples_per_run - ) - - @dataclass class ExperimentRunnerTool: """Run target agents on tasks and get performance metrics.""" @@ -101,15 +41,25 @@ class ExperimentRunnerTool: ) _task: str | None = None db: ExperimentDatabase | None = None - _budget_map: dict[tuple[str, str], SplitBudget] = field( - default_factory=dict, repr=False - ) + _vero_home: Path | None = None + _session_id: str | None = None + # The shared evaluation core. This tool is a thin frontend over it (formats + # results for the LLM, owns on_fatal); the Harbor sidecar is the other frontend. + engine: EvaluationEngine | None = field(default=None, repr=False) def __post_init__(self): - if self.split_budgets: - self._budget_map = { - (sb.split, sb.dataset_id): sb for sb in self.split_budgets - } + self._build_engine() + + def _build_engine(self) -> None: + self.engine = EvaluationEngine( + evaluator=self.evaluator, + budget=BudgetLedger(self.split_budgets or []), + default_task=self._task, + db=self.db, + run_constraints=self.run_constraints, + session_id=self._session_id, + vero_home=self._vero_home, + ) def bind(self, session) -> None: from copy import deepcopy @@ -121,21 +71,23 @@ def bind(self, session) -> None: self._vero_home = session.vero_home self.run_constraints = session.evaluation_parameters self._task = session.task - self._budget_map = {(sb.split, sb.dataset_id): sb for sb in self.split_budgets} + self._build_engine() + + @property + def _budget_ledger(self) -> BudgetLedger: + return self.engine.budget + + @property + def _budget_map(self) -> dict[tuple[str, str], SplitBudget]: + """Back-compat view of the budget ledger, keyed (split, dataset_id). + + Returns the ledger's live SplitBudget objects (mutations propagate). + """ + return self.engine.budget.status() def _get_dataset_info(self, dataset_id: str): - """Get dataset info from the store.""" - from vero.core.dataset import DatasetInfo - from vero.core.dataset.store import load_dataset - - sessions_dir = self._vero_home / "sessions" if self._vero_home else None - dataset_cache = self._vero_home / "datasets" if self._vero_home else None - dataset = load_dataset(sessions_dir, dataset_cache, self._session_id, dataset_id) - return DatasetInfo( - id=dataset_id, - splits={split: len(dataset[split]) for split in dataset}, - features={split: list(dataset[split].features) for split in dataset}, - ) + """Get dataset info from the store (delegates to the shared service).""" + return self.engine._get_dataset_info(dataset_id) async def _resolve_commit(self, commit: str) -> str: """Resolve a commit reference to its full hash. @@ -165,93 +117,32 @@ async def _resolve_commit(self, commit: str) -> str: def _get_samples_from_split( self, dataset_id: str, split: str, num_samples: int ) -> list[int] | None: - """Get a list of sample ids from a split. If num_samples is greater than or equal to the size of the split, return None.""" - dataset_info = self._get_dataset_info(dataset_id) - split_size = dataset_info.splits[split] - num_samples = min(num_samples, split_size) - - if num_samples >= split_size: - return None - - sample_ids = list(range(num_samples)) - return sample_ids + """First-N sample ids, or None for the whole split (delegates to the service).""" + return self.engine._get_samples_from_split(dataset_id, split, num_samples) def _validate_and_count_samples( self, dataset_id: str, split: str, sample_ids: list[int] | None = None ) -> int: - """Validate and count the number of samples in a split. If sample_ids is None, return the size of the split.""" - - dataset_info = self._get_dataset_info(dataset_id) - split_size = dataset_info.splits[split] - - # If None, the full split is being evaluated - if sample_ids is None: - return split_size - - # Validate that the sample ids are within the range of the split - invalid_sample_ids = [] - for sample_id in sample_ids: - if sample_id < 0 or sample_id >= split_size: - invalid_sample_ids.append(sample_id) - - if len(invalid_sample_ids) > 0: - raise ValueError( - f"The provided sample ids are outside the range of the split [0, {split_size - 1}]: {invalid_sample_ids}" - ) - - return len(sample_ids) + """Validate + count samples (delegates to the service).""" + return self.engine._validate_and_count_samples(dataset_id, split, sample_ids) def _validate_split_access(self, dataset_id: str, split: str) -> None: """Validate that the split and dataset combination is allowed.""" - - if (split, dataset_id) not in self._budget_map: - allowed_keys = list(self._budget_map.keys()) - raise InvalidSplitError( - f"No split budget found for the combination (dataset_id={dataset_id}, split={split}) either because it does not exist or because it is not allowed. Allowed combinations: {allowed_keys}" - ) + self._budget_ledger.validate(dataset_id, split) def _check_budget( self, dataset_id: str, split: str, requested_num_samples: int - ) -> str: + ) -> None: """Check that the budget allows for the requested number of samples.""" - - # Check if this split and dataset combination is allowed - self._validate_split_access(dataset_id, split) - budget = self._budget_map[(split, dataset_id)] - - # Determine if we have enough runs left - if not budget.has_run_budget(): - raise ExperimentBudgetExceeded( - f"No runs left for the {split} split of the {dataset_id} dataset." - ) - - # Check against remaining sample budget - if not budget.has_sample_budget(requested_num_samples): - raise ExperimentBudgetExceeded( - f"Requested {requested_num_samples} samples for the {split} split of the {dataset_id} dataset, but the remaining sample budget only allows for {budget.remaining_sample_budget} samples." - ) - - # Check against max samples per run constraint - if budget.exceeds_per_run_budget(requested_num_samples): - raise ExperimentBudgetExceeded( - f"Requested {requested_num_samples} samples for the {split} split of the {dataset_id} dataset, but only {budget.max_samples_per_run} are allowed per run." - ) + self._budget_ledger.check(dataset_id, split, requested_num_samples) def _update_budget(self, dataset_id: str, split: str, num_samples: int) -> str: - """Update the remaining budget for a given dataset and split and return a message about the update.""" - - self._validate_split_access(dataset_id, split) - budget = self._budget_map[(split, dataset_id)] + """Decrement the budget for a given dataset and split; return a status message.""" + budget = self._budget_ledger.record(dataset_id, split, num_samples) info = "" - - # Update the remaining budget - budget.decrement_sample_budget(num_samples) if budget.total_sample_budget is not None: info += f"Used {num_samples} samples from the total {budget.total_sample_budget} sample budget. Remaining sample budget: {budget.remaining_sample_budget}. " - - # Update the remaining runs - budget.decrement_run_budget() if budget.remaining_run_budget is not None: info += f"Used 1 run from the total {budget.total_run_budget} run budget. Remaining runs: {budget.remaining_run_budget}" @@ -263,20 +154,18 @@ async def _evaluate_commit( dataset_id: str, split: str, sample_ids: list[int] | None = None, - add_to_db: bool = True, ) -> Experiment: - """Evaluate a version of the codebase specified by a Git commit on a subset of a dataset.""" + """Run one evaluation via the shared EvaluationEngine. + Uses ``admin=True`` so the service does not meter the budget — this tool + owns budgeting via ``_check_budget``/``_update_budget`` (check-before, + decrement-after) to preserve its existing semantics. + """ + req = EvalRequest( + dataset_id=dataset_id, split=split, commit=commit, sample_ids=sample_ids + ) try: - return await self.evaluator.evaluate( - commit=commit, - dataset_id=dataset_id, - split=split, - task=self._task, - sample_ids=sample_ids, - db=self.db if add_to_db else None, - evaluation_parameters=self.run_constraints, - ) + return await self.engine.evaluate(req, admin=True) except ExperimentRunFailedError as e: if e.returncode >= 3: self.on_fatal(str(e)) @@ -295,8 +184,7 @@ async def check_remaining_experiment_budget( Returns: A string containing the remaining budget for the given dataset and split. """ - self._validate_split_access(dataset_id, split) - budget = self._budget_map[(split, dataset_id)] + budget = self._budget_ledger.get(dataset_id, split) info = "" if budget.total_sample_budget is not None: @@ -355,7 +243,6 @@ async def evaluate_commit( dataset_id=dataset_id, split=split, sample_ids=sample_ids, - add_to_db=True, ) except Exception as e: raise e @@ -385,7 +272,9 @@ async def evaluate_commit_on_all_splits( """ accessible_splits = [ - split for (split, ds_id) in self._budget_map.keys() if ds_id == dataset_id + split + for (split, ds_id) in self._budget_ledger.status().keys() + if ds_id == dataset_id ] logger.info( @@ -403,7 +292,7 @@ async def evaluate_commit_on_all_splits( for split in accessible_splits: full_split_size = self._validate_and_count_samples(dataset_id, split) - budget = self._budget_map.get((split, dataset_id)) + budget = self._budget_ledger.get(dataset_id, split) # Cap samples to remaining budget if needed requested_num_samples = full_split_size @@ -436,7 +325,6 @@ async def evaluate_commit_on_all_splits( dataset_id=dataset_id, split=split, sample_ids=sample_ids, - add_to_db=True, ) except Exception as e: results[split] = e diff --git a/vero/tests/test_budget.py b/vero/tests/test_budget.py new file mode 100644 index 0000000..c0d5047 --- /dev/null +++ b/vero/tests/test_budget.py @@ -0,0 +1,94 @@ +"""Tests for BudgetLedger (vero.core.budget).""" + +import json + +import pytest + +from vero.core.budget import BudgetLedger, SplitBudget +from vero.exceptions import ExperimentBudgetExceeded, InvalidSplitError + + +def _ledger(**kwargs): + return BudgetLedger( + [ + SplitBudget( + split="dev", dataset_id="ds1", total_sample_budget=100, total_run_budget=3 + ) + ], + **kwargs, + ) + + +class TestAllowlist: + def test_validate_allows_configured_pair(self): + _ledger().validate("ds1", "dev") # no raise + + def test_validate_rejects_unknown_pair(self): + with pytest.raises(InvalidSplitError): + _ledger().validate("ds1", "test") + with pytest.raises(InvalidSplitError): + _ledger().validate("other", "dev") + + +class TestCheck: + def test_check_passes_within_budget(self): + _ledger().check("ds1", "dev", 50) + + def test_check_rejects_over_sample_budget(self): + with pytest.raises(ExperimentBudgetExceeded): + _ledger().check("ds1", "dev", 101) + + def test_check_rejects_no_runs_left(self): + led = BudgetLedger([SplitBudget(split="dev", dataset_id="ds1", total_run_budget=1)]) + led.record("ds1", "dev", 0) # consume the one run + with pytest.raises(ExperimentBudgetExceeded): + led.check("ds1", "dev", 0) + + def test_check_rejects_over_per_run(self): + led = BudgetLedger( + [SplitBudget(split="dev", dataset_id="ds1", total_sample_budget=100, max_samples_per_run=10)] + ) + with pytest.raises(ExperimentBudgetExceeded): + led.check("ds1", "dev", 11) + + +class TestRecord: + def test_record_decrements(self): + led = _ledger() + b = led.record("ds1", "dev", 30) + assert b.remaining_sample_budget == 70 + assert b.remaining_run_budget == 2 + + +class TestReserve: + @pytest.mark.asyncio + async def test_reserve_checks_then_decrements(self): + led = _ledger() + b = await led.reserve("ds1", "dev", 40) + assert b.remaining_sample_budget == 60 + assert b.remaining_run_budget == 2 + + @pytest.mark.asyncio + async def test_reserve_rejects_without_decrementing(self): + led = _ledger() + with pytest.raises(ExperimentBudgetExceeded): + await led.reserve("ds1", "dev", 101) + # rejected request costs nothing + assert led.get("ds1", "dev").remaining_sample_budget == 100 + assert led.get("ds1", "dev").remaining_run_budget == 3 + + +class TestPersistence: + def test_flush_writes_durable_json(self, tmp_path): + path = tmp_path / "ledger.json" + led = _ledger(persist_path=path) + led.record("ds1", "dev", 25) + data = json.loads(path.read_text()) + entry = next(e for e in data if e["split"] == "dev" and e["dataset_id"] == "ds1") + assert entry["remaining_sample_budget"] == 75 + assert entry["remaining_run_budget"] == 2 + + def test_no_flush_when_in_memory(self, tmp_path): + led = _ledger() # persist_path=None + led.record("ds1", "dev", 25) # no file written, no error + assert not list(tmp_path.iterdir()) diff --git a/vero/tests/test_dataset_viewer.py b/vero/tests/test_dataset_viewer.py index 044f569..632ffb4 100644 --- a/vero/tests/test_dataset_viewer.py +++ b/vero/tests/test_dataset_viewer.py @@ -3,18 +3,12 @@ from __future__ import annotations import json -import tempfile -from pathlib import Path -from unittest.mock import MagicMock import pytest from datasets import Dataset, DatasetDict from vero.core.dataset import ( - DatasetInfo, DefaultSplitNames, - SplitAccess, default_split_accesses, - get_non_viewable_splits, ) from vero.core.dataset.store import save_dataset from vero.policy import Session diff --git a/vero/tests/test_e2e_optimization.py b/vero/tests/test_e2e_optimization.py index e242bab..121525b 100644 --- a/vero/tests/test_e2e_optimization.py +++ b/vero/tests/test_e2e_optimization.py @@ -49,7 +49,7 @@ async def test_matmul_kernel_evaluates(workspace): """Naive kernel evaluates correctly — all samples produce valid scores.""" kernel_dir, task_dir, dataset_path, vero_home = workspace - from vero.evaluator import run_evaluation + from vero.evaluation.evaluator import run_evaluation result = await run_evaluation( project_path=kernel_dir, @@ -77,7 +77,7 @@ async def test_kernel_change_changes_score(workspace): """Modifying kernel code and re-evaluating produces different scores.""" kernel_dir, task_dir, dataset_path, vero_home = workspace - from vero.evaluator import run_evaluation + from vero.evaluation.evaluator import run_evaluation # Evaluate naive kernel result_v1 = await run_evaluation( @@ -170,7 +170,7 @@ def multiply(a, b): from datasets import DatasetDict from vero.core.dataset.store import save_dataset - from vero.evaluator import Evaluator + from vero.evaluation.evaluator import Evaluator session_id = "test-workspace-eval" ds = DatasetDict.load_from_disk(str(dataset_path)) diff --git a/vero/tests/test_engine.py b/vero/tests/test_engine.py new file mode 100644 index 0000000..789927b --- /dev/null +++ b/vero/tests/test_engine.py @@ -0,0 +1,97 @@ +"""Tests for EvaluationEngine (vero.evaluation.engine) — the shared evaluation core.""" + +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from vero.core.budget import BudgetLedger, SplitBudget +from vero.core.dataset import DatasetInfo +from vero.exceptions import ExperimentBudgetExceeded, InvalidSplitError +from vero.evaluation.engine import EvalRequest, EvaluationEngine + +_DATASET_INFO = DatasetInfo( + id="ds1", splits={"dev": 100, "test": 50}, features={"dev": [], "test": []} +) + + +def _make_service(budgets=None, monkeypatch=None): + evaluator = MagicMock() + evaluator.evaluate = AsyncMock(return_value="EXPERIMENT") # sentinel + svc = EvaluationEngine( + evaluator=evaluator, + budget=BudgetLedger( + budgets + or [SplitBudget(split="dev", dataset_id="ds1", total_sample_budget=100, total_run_budget=3)] + ), + default_task="main", + session_id="s1", + ) + if monkeypatch is not None: + monkeypatch.setattr(svc, "_get_dataset_info", lambda dataset_id: _DATASET_INFO) + return svc + + +class TestResolveSamples: + def test_rejects_both_sample_ids_and_num_samples(self, monkeypatch): + svc = _make_service(monkeypatch=monkeypatch) + with pytest.raises(ValueError, match="both sample_ids and num_samples"): + svc.resolve_samples(EvalRequest(dataset_id="ds1", split="dev", sample_ids=[0], num_samples=1)) + + def test_num_samples_first_n(self, monkeypatch): + svc = _make_service(monkeypatch=monkeypatch) + ids, n = svc.resolve_samples(EvalRequest(dataset_id="ds1", split="dev", num_samples=5)) + assert ids == [0, 1, 2, 3, 4] and n == 5 + + def test_num_samples_full_split_is_none(self, monkeypatch): + svc = _make_service(monkeypatch=monkeypatch) + ids, n = svc.resolve_samples(EvalRequest(dataset_id="ds1", split="dev", num_samples=100)) + assert ids is None and n == 100 # None == whole split + + def test_sample_ids_out_of_range_raises(self, monkeypatch): + svc = _make_service(monkeypatch=monkeypatch) + with pytest.raises(ValueError, match="outside the range"): + svc.resolve_samples(EvalRequest(dataset_id="ds1", split="dev", sample_ids=[0, 999])) + + +class TestEvaluate: + @pytest.mark.asyncio + async def test_evaluate_meters_and_runs(self, monkeypatch): + svc = _make_service(monkeypatch=monkeypatch) + exp = await svc.evaluate(EvalRequest(dataset_id="ds1", split="dev", commit="c1", num_samples=10)) + + assert exp == "EXPERIMENT" + svc.evaluator.evaluate.assert_awaited_once() + kwargs = svc.evaluator.evaluate.await_args.kwargs + assert kwargs["commit"] == "c1" and kwargs["split"] == "dev" and kwargs["task"] == "main" + assert kwargs["sample_ids"] == list(range(10)) + # budget metered + assert svc.status()[("dev", "ds1")].remaining_run_budget == 2 + assert svc.status()[("dev", "ds1")].remaining_sample_budget == 90 + + @pytest.mark.asyncio + async def test_evaluate_budget_exhausted_does_not_run(self, monkeypatch): + # 50-sample budget; num_samples=60 caps to 60 (< split size 100) and exceeds it + svc = _make_service( + budgets=[SplitBudget(split="dev", dataset_id="ds1", total_sample_budget=50)], + monkeypatch=monkeypatch, + ) + with pytest.raises(ExperimentBudgetExceeded): + await svc.evaluate(EvalRequest(dataset_id="ds1", split="dev", commit="c1", num_samples=60)) + svc.evaluator.evaluate.assert_not_awaited() + + @pytest.mark.asyncio + async def test_evaluate_unknown_split_rejected(self, monkeypatch): + svc = _make_service(monkeypatch=monkeypatch) + with pytest.raises(InvalidSplitError): + await svc.evaluate(EvalRequest(dataset_id="ds1", split="test", commit="c1", num_samples=10)) + + @pytest.mark.asyncio + async def test_admin_bypasses_budget(self, monkeypatch): + svc = _make_service(monkeypatch=monkeypatch) + # 'test' isn't in the agent budget map, but admin may evaluate it + await svc.evaluate( + EvalRequest(dataset_id="ds1", split="test", commit="c1", num_samples=10), admin=True + ) + svc.evaluator.evaluate.assert_awaited_once() + # nothing metered + assert svc.status()[("dev", "ds1")].remaining_run_budget == 3 diff --git a/vero/tests/test_eval_strategy.py b/vero/tests/test_eval_strategy.py new file mode 100644 index 0000000..6bf183e --- /dev/null +++ b/vero/tests/test_eval_strategy.py @@ -0,0 +1,66 @@ +"""Tests for the Evaluator strategy seam (vero.evaluation.strategy).""" + +import contextlib +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from vero.core.db.candidate import Candidate +from vero.core.db.dataset import DatasetSample, DatasetSubset +from vero.core.db.result import SampleResult +from vero.core.db.run import ExperimentRun +from vero.core.evaluation import EvaluationParameters +from vero.core.sessions import get_vero_home_dir, save_sample_result +from vero.evaluation.evaluator import Evaluator + + +def _mock_workspace(): + ws = MagicMock() + ws.name = "repo" + ws.is_dirty = AsyncMock(return_value=False) + + @contextlib.asynccontextmanager + async def _at(commit): + yield + + ws.at = _at + return ws + + +@pytest.mark.asyncio +async def test_injected_strategy_produces_results(tmp_path, monkeypatch): + monkeypatch.setenv("VERO_HOME_DIR", str(tmp_path / "vero_home")) + + called = {} + + class FakeStrategy: + async def produce_sample_results(self, *, workspace, params, result_dir): + called["yes"] = True + save_sample_result( + get_vero_home_dir() / "sessions", + params.session_id, + params.result_id, + sample_id=0, + result=SampleResult( + dataset_sample=DatasetSample(sample_id=0, split="test", dataset_id="ds"), + score=1.0, + commit=params.run.candidate.commit, + result_id=params.result_id, + ), + ) + + evaluator = Evaluator(_mock_workspace(), session_id="s", eval_strategy=FakeStrategy()) + params = EvaluationParameters( + run=ExperimentRun( + candidate=Candidate(commit="c1", repo_name="repo"), + dataset_subset=DatasetSubset(split="test", dataset_id="ds", sample_ids=[0]), + ), + session_id="s", + ) + + result = await evaluator.run(params, use_copy=False) + + assert called.get("yes") is True + assert result.sample_results[0].score == 1.0 + # Mode-A staging path was NOT taken (strategy branch); sandbox untouched + evaluator.workspace.sandbox.upload.assert_not_called() diff --git a/vero/tests/test_experiment_runner.py b/vero/tests/test_experiment_runner.py index d8a33fc..f2baac5 100644 --- a/vero/tests/test_experiment_runner.py +++ b/vero/tests/test_experiment_runner.py @@ -1,6 +1,6 @@ """Tests for ExperimentRunnerTool and SplitBudget.""" -from unittest.mock import AsyncMock, MagicMock, patch +from unittest.mock import AsyncMock, MagicMock import pytest @@ -23,14 +23,18 @@ @pytest.fixture(autouse=True) def mock_dataset_info(monkeypatch): - """Mock _get_dataset_info to avoid dataset store dependency in tests.""" - original = ExperimentRunnerTool._get_dataset_info + """Mock _get_dataset_info to avoid dataset store dependency in tests. + + The tool delegates dataset resolution to EvaluationEngine, so patch there + (the tool's own _get_dataset_info also delegates to the service). + """ + from vero.evaluation.engine import EvaluationEngine def patched_get_dataset_info(self, dataset_id): return _DEFAULT_DATASET_INFO monkeypatch.setattr( - ExperimentRunnerTool, "_get_dataset_info", patched_get_dataset_info + EvaluationEngine, "_get_dataset_info", patched_get_dataset_info ) diff --git a/vero/tests/test_external_tasks.py b/vero/tests/test_external_tasks.py index 51a730d..2c60bff 100644 --- a/vero/tests/test_external_tasks.py +++ b/vero/tests/test_external_tasks.py @@ -18,7 +18,7 @@ import pytest -from vero.evaluator import run_evaluation +from vero.evaluation.evaluator import run_evaluation def _init_git(path: Path) -> None: diff --git a/vero/tests/test_harbor_app.py b/vero/tests/test_harbor_app.py new file mode 100644 index 0000000..5b6a503 --- /dev/null +++ b/vero/tests/test_harbor_app.py @@ -0,0 +1,85 @@ +"""Tests for vero.harbor.app — FastAPI routes + agent/admin auth.""" + +from unittest.mock import AsyncMock, MagicMock + +from fastapi.testclient import TestClient + +from vero.exceptions import ExperimentBudgetExceeded +from vero.harbor.app import create_app +from vero.harbor.auth import check_admin, generate_token, read_admin_token, write_admin_token +from vero.harbor.protocol import EvalSummary, StatusSummary +from vero.harbor.server import SubmitDisabledError + +TOKEN = "secret-admin-token" + + +def _client(sidecar=None, verifier=None): + sidecar = sidecar or MagicMock() + verifier = verifier or MagicMock() + return TestClient(create_app(sidecar=sidecar, verifier=verifier, admin_token=TOKEN)) + + +class TestAuthHelpers: + def test_token_roundtrip_and_perms(self, tmp_path): + tok = generate_token() + p = write_admin_token(tmp_path / "t", tok) + assert read_admin_token(p) == tok + assert (p.stat().st_mode & 0o777) == 0o600 + + def test_check_admin(self): + assert check_admin(f"Bearer {TOKEN}", TOKEN) is True + assert check_admin("Bearer wrong", TOKEN) is False + assert check_admin(None, TOKEN) is False + assert check_admin(TOKEN, TOKEN) is False # missing "Bearer " + + +class TestAgentEndpoints: + def test_eval(self): + sidecar = MagicMock() + sidecar.evaluate = AsyncMock( + return_value=EvalSummary( + commit="c1", split="train", dataset_id="ds", n_samples=2, + mean_score=0.5, result_path="/r", budget_remaining=None, + ) + ) + r = _client(sidecar=sidecar).post( + "/eval", json={"dataset_id": "ds", "split": "train", "num_samples": 2} + ) + assert r.status_code == 200 + assert r.json()["mean_score"] == 0.5 + assert sidecar.evaluate.await_args.kwargs["admin"] is False + + def test_status(self): + sidecar = MagicMock() + sidecar.status = MagicMock( + return_value=StatusSummary(submit_enabled=True, splits=[{"split": "train"}]) + ) + r = _client(sidecar=sidecar).get("/status") + assert r.status_code == 200 and r.json()["submit_enabled"] is True + + def test_submit_disabled_maps_to_409(self): + sidecar = MagicMock() + sidecar.submit = AsyncMock(side_effect=SubmitDisabledError("disabled")) + r = _client(sidecar=sidecar).post("/submit", json={"commit": "c1"}) + assert r.status_code == 409 + + def test_budget_exceeded_maps_to_429(self): + sidecar = MagicMock() + sidecar.evaluate = AsyncMock(side_effect=ExperimentBudgetExceeded("no budget")) + r = _client(sidecar=sidecar).post("/eval", json={"dataset_id": "ds", "split": "train"}) + assert r.status_code == 429 + + +class TestAdminEndpoint: + def test_finalize_requires_token(self): + verifier = MagicMock() + verifier.finalize = AsyncMock(return_value={"reward": 1.0}) + client = _client(verifier=verifier) + + assert client.post("/finalize").status_code == 403 # no token + assert client.post("/finalize", headers={"Authorization": "Bearer wrong"}).status_code == 403 + verifier.finalize.assert_not_awaited() + + r = client.post("/finalize", headers={"Authorization": f"Bearer {TOKEN}"}) + assert r.status_code == 200 and r.json() == {"reward": 1.0} + verifier.finalize.assert_awaited_once() diff --git a/vero/tests/test_harbor_build.py b/vero/tests/test_harbor_build.py new file mode 100644 index 0000000..999589b --- /dev/null +++ b/vero/tests/test_harbor_build.py @@ -0,0 +1,131 @@ +"""Unit test for the `vero harbor build` compiler: a BuildConfig compiles to a +well-formed Harbor task directory whose ServeConfig validates and whose rendered +task.toml / compose / scripts parse. No Docker (that's the e2e).""" + +from __future__ import annotations + +import json +import subprocess +import tomllib +from pathlib import Path + +import pytest +import yaml + +from vero.harbor.build import BuildConfig, compile_task +from vero.harbor.serve import ServeConfig + + +def _stub_vero(root: Path) -> Path: + """A minimal stand-in for the vero source tree (compiler just copies it).""" + d = root / "vero-src" + (d / "src" / "vero").mkdir(parents=True) + (d / "pyproject.toml").write_text("[project]\nname='scale-vero'\nversion='0'\n") + (d / "README.md").write_text("vero\n") + (d / "src" / "vero" / "__init__.py").write_text("") + return d + + +def _agent_repo(root: Path) -> Path: + d = root / "agent" + (d / "src" / "gsm8k_agent").mkdir(parents=True) + (d / "pyproject.toml").write_text( + "[project]\nname='gsm8k-agent'\nversion='0'\n\n" + '[tool.uv.sources]\nscale-vero = { path = "../../", editable = true }\n' + ) + (d / "src" / "gsm8k_agent" / "agent.py").write_text("X = 1\n") + subprocess.run(["git", "init", "-q"], cwd=d, check=True) + subprocess.run(["git", "add", "-A"], cwd=d, check=True) + subprocess.run( + ["git", "-c", "user.name=t", "-c", "user.email=t@t", "commit", "-qm", "i"], + cwd=d, check=True, + ) + return d + + +def _dataset(root: Path) -> Path: + from datasets import Dataset, DatasetDict + + ds = DatasetDict({ + "validation": Dataset.from_dict({"question": ["1+1?"], "answer": ["#### 2"]}), + "test": Dataset.from_dict({"question": ["2+2?"], "answer": ["#### 4"]}), + }) + p = root / "ds" + ds.save_to_disk(str(p)) + return p + + +@pytest.fixture +def built(tmp_path): + config = BuildConfig( + name="vero/gsm8k-opt", + description="optimize gsm8k", + agent_repo=str(_agent_repo(tmp_path)), + mode="A", + task="gsm8k", + task_module="gsm8k_agent.vero_tasks", + dataset=str(_dataset(tmp_path)), + splits=[ + {"split": "validation", "access": "non_viewable"}, + {"split": "test", "access": "no_access"}, + ], + budgets=[{"split": "validation", "total_run_budget": 5}], + reward_mode="auto_best", + selection_split="validation", + targets=[{"split": "test", "reward_key": "reward"}], + read_only_paths=["src/gsm8k_agent/vero_tasks"], + secrets=["OPENAI_API_KEY"], + ) + out = compile_task(config, tmp_path / "task", vero_root=_stub_vero(tmp_path)) + return out + + +def test_structure(built): + for rel in [ + "task.toml", "instruction.md", + "environment/docker-compose.yaml", "environment/Dockerfile", + "environment/sidecar/Dockerfile", "environment/sidecar/serve.json", + "environment/main/seed.sh", "environment/vero/pyproject.toml", + "environment/agent-baseline/.git", "environment/agent-seed/.git", + "environment/sidecar/vero_home", "tests/test.sh", "solution/solve.sh", + ]: + assert (built / rel).exists(), f"missing {rel}" + + +def test_serve_config_validates(built): + cfg = ServeConfig.from_file(built / "environment" / "sidecar" / "serve.json") + assert cfg.repo_path == "/opt/agent-baseline" + assert cfg.agent_repo_path == "/work/agent" + assert cfg.task == "gsm8k" + assert cfg.dataset_id # registered + assert cfg.base_commit # baseline sha recorded for auto_best exclusion + assert cfg.targets and cfg.targets[0].split == "test" + assert cfg.budgets[0]["dataset_id"] == cfg.dataset_id + + +def test_rendered_files_parse(built): + tomllib.loads((built / "task.toml").read_text()) # valid TOML + compose = yaml.safe_load((built / "environment/docker-compose.yaml").read_text()) + assert "eval-sidecar" in compose["services"] + assert compose["services"]["main"]["depends_on"]["eval-sidecar"]["condition"] == "service_healthy" + # secret reaches the sidecar only, via host-resolved compose interpolation + assert compose["services"]["eval-sidecar"]["environment"]["OPENAI_API_KEY"] == "${OPENAI_API_KEY}" + assert "OPENAI_API_KEY" not in compose["services"]["main"].get("environment", {}) + + +def test_vero_source_path_rewritten(built): + pyproject = (built / "environment/agent-baseline/pyproject.toml").read_text() + assert 'path = "/opt/vero"' in pyproject + assert "../../" not in pyproject + + +def test_baseline_sha_shared(built): + def head(p): + return subprocess.run( + ["git", "-C", str(built / p), "rev-parse", "HEAD"], + capture_output=True, text=True, check=True, + ).stdout.strip() + + assert head("environment/agent-baseline") == head("environment/agent-seed") + cfg = json.loads((built / "environment/sidecar/serve.json").read_text()) + assert cfg["base_commit"] == head("environment/agent-baseline") diff --git a/vero/tests/test_harbor_cli.py b/vero/tests/test_harbor_cli.py new file mode 100644 index 0000000..afab16c --- /dev/null +++ b/vero/tests/test_harbor_cli.py @@ -0,0 +1,82 @@ +"""Tests for vero.harbor.cli — the agent/verifier CLI clients (mocked httpx).""" + +import json + +from click.testing import CliRunner + +from vero.harbor.cli import harbor + + +class _Resp: + def __init__(self, status_code, data): + self.status_code = status_code + self._data = data + self.text = json.dumps(data) + + def json(self): + return self._data + + +def _patch_httpx(monkeypatch, resp, capture): + import httpx + + def fake_request(method, url, *, json=None, headers=None, timeout=None): + capture.update(method=method, url=url, json=json, headers=headers) + return resp + + monkeypatch.setattr(httpx, "request", fake_request) + + +def test_eval_posts_and_prints(monkeypatch): + monkeypatch.setenv("VERO_EVAL_URL", "http://sidecar:8000") + cap: dict = {} + _patch_httpx(monkeypatch, _Resp(200, {"mean_score": 0.5}), cap) + + result = CliRunner().invoke( + harbor, ["eval", "--dataset-id", "ds", "--split", "train", "--num-samples", "3"] + ) + assert result.exit_code == 0 + assert cap["method"] == "POST" and cap["url"].endswith("/eval") + assert cap["json"] == {"dataset_id": "ds", "split": "train", "num_samples": 3} + assert json.loads(result.output)["mean_score"] == 0.5 + + +def test_eval_error_status_raises(monkeypatch): + monkeypatch.setenv("VERO_EVAL_URL", "http://sidecar:8000") + _patch_httpx(monkeypatch, _Resp(429, {"error": "no budget"}), {}) + result = CliRunner().invoke(harbor, ["eval", "--dataset-id", "ds", "--split", "train"]) + assert result.exit_code != 0 + assert "429" in result.output + + +def test_eval_missing_url_errors(): + result = CliRunner(env={"VERO_EVAL_URL": ""}).invoke( + harbor, ["eval", "--dataset-id", "ds", "--split", "train"] + ) + assert result.exit_code != 0 + assert "VERO_EVAL_URL" in result.output + + +def test_status_get(monkeypatch): + monkeypatch.setenv("VERO_EVAL_URL", "http://sidecar:8000") + cap: dict = {} + _patch_httpx(monkeypatch, _Resp(200, {"submit_enabled": True}), cap) + result = CliRunner().invoke(harbor, ["status"]) + assert result.exit_code == 0 and cap["method"] == "GET" and cap["url"].endswith("/status") + + +def test_finalize_uses_token_and_writes_reward(monkeypatch, tmp_path): + monkeypatch.setenv("VERO_EVAL_URL", "http://sidecar:8000") + token_file = tmp_path / "tok" + token_file.write_text("T0KEN") + out = tmp_path / "reward.json" + cap: dict = {} + _patch_httpx(monkeypatch, _Resp(200, {"reward": 1.0}), cap) + + result = CliRunner().invoke( + harbor, ["finalize", "--token-file", str(token_file), "--output", str(out)] + ) + assert result.exit_code == 0 + assert cap["url"].endswith("/finalize") + assert cap["headers"]["Authorization"] == "Bearer T0KEN" + assert json.loads(out.read_text()) == {"reward": 1.0} diff --git a/vero/tests/test_harbor_dataset.py b/vero/tests/test_harbor_dataset.py new file mode 100644 index 0000000..82caca2 --- /dev/null +++ b/vero/tests/test_harbor_dataset.py @@ -0,0 +1,49 @@ +"""Tests for vero.harbor.dataset — partition compile + local task enumeration.""" + +import pytest + +from vero.harbor.dataset import ( + build_harbor_dataset, + enumerate_local_task_names, + validate_partition, +) + + +def _make_task_dir(root, name): + d = root / name + d.mkdir(parents=True) + (d / "task.toml").write_text("[task]\nname='x'\n") + return d + + +class TestBuildDataset: + def test_partition_to_datasetdict(self): + ds = build_harbor_dataset({"train": ["a", "b"], "test": ["c"]}) + assert set(ds.keys()) == {"train", "test"} + assert ds["train"]["task_name"] == ["a", "b"] + assert ds["test"]["task_name"] == ["c"] + + def test_empty_partition_raises(self): + with pytest.raises(ValueError): + build_harbor_dataset({}) + + +class TestEnumerateLocal: + def test_dataset_dir_of_tasks(self, tmp_path): + _make_task_dir(tmp_path, "task_b") + _make_task_dir(tmp_path, "task_a") + (tmp_path / "not_a_task").mkdir() # no task.toml -> excluded + assert enumerate_local_task_names(tmp_path) == ["task_a", "task_b"] + + def test_single_task_dir(self, tmp_path): + d = _make_task_dir(tmp_path, "solo") + assert enumerate_local_task_names(d) == ["solo"] + + +class TestValidatePartition: + def test_ok_when_subset(self): + validate_partition({"train": ["a"], "test": ["b"]}, ["a", "b", "c"]) + + def test_unknown_names_raise(self): + with pytest.raises(ValueError, match="not found"): + validate_partition({"test": ["a", "zzz"]}, ["a", "b"]) diff --git a/vero/tests/test_harbor_protocol.py b/vero/tests/test_harbor_protocol.py new file mode 100644 index 0000000..9a079e1 --- /dev/null +++ b/vero/tests/test_harbor_protocol.py @@ -0,0 +1,86 @@ +"""Tests for vero.harbor.protocol — sidecar wire types + redaction/summary.""" + +from vero.core.budget import SplitBudget +from vero.core.dataset.base import SplitAccess, SplitAccessLevel +from vero.core.db.candidate import Candidate +from vero.core.db.dataset import DatasetSubset +from vero.core.db.result import ( + ExperimentResult, + ExperimentResultStatus, + SampleResult, +) +from vero.core.db.run import ExperimentRun +from vero.harbor.protocol import ( + build_status, + summarize_experiment, + tier_for_split, +) + + +def _experiment(scores: list[float]) -> "object": + from vero.core.db.database import Experiment + from vero.core.db.dataset import DatasetSample + + run = ExperimentRun( + candidate=Candidate(commit="abc123", repo_name="r"), + dataset_subset=DatasetSubset(split="validation", dataset_id="ds1"), + ) + sample_results = { + i: SampleResult( + dataset_sample=DatasetSample(sample_id=i, split="validation", dataset_id="ds1"), + score=s, + ) + for i, s in enumerate(scores) + } + result = ExperimentResult( + run_id=run.id, status=ExperimentResultStatus.SUCCESS, sample_results=sample_results + ) + return Experiment(run=run, result=result) + + +class TestTier: + def test_listed_split_tier(self): + accesses = [SplitAccess.no_access("test"), SplitAccess.non_viewable("validation")] + assert tier_for_split("test", accesses) == SplitAccessLevel.no_access + assert tier_for_split("validation", accesses) == SplitAccessLevel.non_viewable + + def test_unlisted_defaults_viewable(self): + assert tier_for_split("train", []) == SplitAccessLevel.viewable + + +class TestSummarize: + def test_aggregate_only_no_per_sample(self): + exp = _experiment([1.0, 0.0, 1.0]) + summary = summarize_experiment(exp, result_path="/x/y") + assert summary.commit == "abc123" + assert summary.split == "validation" + assert summary.dataset_id == "ds1" + assert summary.n_samples == 3 + assert summary.mean_score is not None + # no per-sample field exists on the summary at all + assert not any("sample" in k for k in summary.to_dict() if k != "n_samples") + + def test_budget_serialized(self): + exp = _experiment([1.0]) + b = SplitBudget(split="validation", dataset_id="ds1", total_run_budget=5) + d = summarize_experiment(exp, result_path=None, budget_remaining=b).to_dict() + assert d["budget_remaining"]["remaining_run_budget"] == 5 + assert d["result_path"] is None + + +class TestBuildStatus: + def test_lists_budgeted_splits_with_tier(self): + budget = { + ("train", "ds1"): SplitBudget(split="train", dataset_id="ds1", total_run_budget=10), + ("validation", "ds1"): SplitBudget(split="validation", dataset_id="ds1", total_run_budget=3), + } + accesses = [SplitAccess.non_viewable("validation")] # train defaults viewable + status = build_status(submit_enabled=True, budget=budget, split_accesses=accesses) + + assert status.submit_enabled is True + by_split = {s["split"]: s for s in status.splits} + assert by_split["train"]["tier"] == str(SplitAccessLevel.viewable) + assert by_split["train"]["agent_evaluable"] is True + assert by_split["validation"]["tier"] == str(SplitAccessLevel.non_viewable) + assert by_split["validation"]["agent_evaluable"] is True + assert by_split["validation"]["remaining_run_budget"] == 3 diff --git a/vero/tests/test_harbor_runner.py b/vero/tests/test_harbor_runner.py new file mode 100644 index 0000000..15df89e --- /dev/null +++ b/vero/tests/test_harbor_runner.py @@ -0,0 +1,128 @@ +"""Tests for vero.harbor.runner.HarborRunner — command build, collation, resume.""" + +import json +from pathlib import Path +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from vero.core.db.candidate import Candidate +from vero.core.db.dataset import DatasetSample, DatasetSubset +from vero.core.db.result import SampleResult +from vero.core.db.run import ExperimentRun +from vero.core.evaluation import EvaluationParameters +from vero.core.sessions import ( + get_vero_home_dir, + load_all_sample_results, + save_sample_result, +) +from vero.harbor.config import HarborConfig +from vero.harbor.runner import HarborRunner + + +def _runner(reward_key=None, task_source="org/ds@1"): + return HarborRunner( + HarborConfig( + task_source=task_source, + agent_import_path="pkg.mod:Agent", + model="anthropic/x", + environment="modal", + reward_key=reward_key, + ) + ) + + +def _params(): + return EvaluationParameters( + run=ExperimentRun( + candidate=Candidate(commit="c1", repo_name="r"), + dataset_subset=DatasetSubset(split="test", dataset_id="ds", sample_ids=[0, 1]), + ), + session_id="s", + ) + + +def _write_trial(jobs_dir: Path, trial: str, task_name: str, rewards: dict): + # Real harbor layout: ///result.json, plus a job-level + # //result.json summary (no task_name) that collation must skip. + run = jobs_dir / "2026-01-01__00-00-00" + d = run / trial + d.mkdir(parents=True, exist_ok=True) + (run / "result.json").write_text(json.dumps({"job": "summary"})) # job-level, no task_name + (d / "result.json").write_text( + json.dumps({"task_name": task_name, "trial_name": trial, "verifier_result": {"rewards": rewards}}) + ) + + +class TestBuildCommand: + def test_registry_source_and_flags(self): + cmd = _runner()._build_command("/wt", _params(), ["t0", "t1"], Path("/jobs")) + assert cmd[:5] == ["uv", "run", "--project", "/wt", "harbor"] + assert "-d" in cmd and "org/ds@1" in cmd + assert "--agent-import-path" in cmd and "pkg.mod:Agent" in cmd + assert cmd.count("-i") == 2 and "t0" in cmd and "t1" in cmd + assert "-m" in cmd and "-e" in cmd and "--jobs-dir" in cmd + + def test_local_source(self, tmp_path): + cmd = _runner(task_source=str(tmp_path))._build_command("/wt", _params(), ["t0"], Path("/jobs")) + assert "-p" in cmd and str(tmp_path) in cmd + assert "-d" not in cmd + + +class TestExtractReward: + def test_priority_pass_then_reward_then_mean(self): + r = _runner() + assert r._extract_reward({"pass": 1.0, "reward": 0.0}) == 1.0 + assert r._extract_reward({"reward": 0.7}) == 0.7 + assert r._extract_reward({"a": 0.2, "b": 0.4}) == pytest.approx(0.3) + + def test_reward_key_override(self): + assert _runner(reward_key="acc")._extract_reward({"acc": 0.9, "pass": 0.0}) == 0.9 + + +class TestCollate: + @pytest.mark.asyncio + async def test_produces_results_and_marks_missing(self, tmp_path, monkeypatch): + monkeypatch.setenv("VERO_HOME_DIR", str(tmp_path / "vh")) + runner = _runner() + params = _params() + result_dir = tmp_path / "result" + jobs = result_dir / "jobs" + _write_trial(jobs, "trial0", "t0", {"pass": 1.0, "extra": 0.5}) + # no trial for t1 + + monkeypatch.setattr(runner, "_task_names_for", lambda p: [(0, "t0"), (1, "t1")]) + runner._run_harbor = AsyncMock() # fixtures already present; don't shell out + + ws = MagicMock(project_path="/wt") + await runner.produce_sample_results(workspace=ws, params=params, result_dir=result_dir) + + results = load_all_sample_results(get_vero_home_dir() / "sessions", "s", params.result_id) + assert results[0].score == 1.0 + assert results[0].metrics["extra"] == 0.5 + assert results[1].error is not None # missing trial -> error sample + + @pytest.mark.asyncio + async def test_resume_only_runs_pending(self, tmp_path, monkeypatch): + monkeypatch.setenv("VERO_HOME_DIR", str(tmp_path / "vh")) + runner = _runner() + params = _params() + result_dir = tmp_path / "result" + + # sample 0 already done + save_sample_result( + get_vero_home_dir() / "sessions", "s", params.result_id, sample_id=0, + result=SampleResult( + dataset_sample=DatasetSample(sample_id=0, split="test", dataset_id="ds"), + score=1.0, commit="c1", result_id=params.result_id, + ), + ) + _write_trial(result_dir / "jobs", "trial1", "t1", {"pass": 0.0}) + monkeypatch.setattr(runner, "_task_names_for", lambda p: [(0, "t0"), (1, "t1")]) + runner._run_harbor = AsyncMock() + + ws = MagicMock(project_path="/wt") + await runner.produce_sample_results(workspace=ws, params=params, result_dir=result_dir) + + # only the pending task name was passed to harbor + assert runner._run_harbor.await_args.args[2] == ["t1"] diff --git a/vero/tests/test_harbor_serve.py b/vero/tests/test_harbor_serve.py new file mode 100644 index 0000000..edbbbca --- /dev/null +++ b/vero/tests/test_harbor_serve.py @@ -0,0 +1,143 @@ +"""Integration test for vero.harbor.serve — assemble the sidecar/verifier from a +ServeConfig and run a real (deterministic, no-LLM) Mode-A eval + finalize. + +Reuses the external-task project pattern: a trivial agent + a separate task project, +scored deterministically. Validates that `build_components` produces a working engine, +and that a real eval flows into verifier selection + scoring. +""" + +from __future__ import annotations + +import subprocess +import textwrap +from pathlib import Path + +import pytest + +from vero.core.dataset.store import resolve_and_save_dataset +from vero.evaluation.engine import EvalRequest +from vero.harbor.serve import ServeConfig, build_components + + +def _git(path: Path, *args: str) -> str: + return subprocess.run( + ["git", "-c", "user.name=t", "-c", "user.email=t@t", *args], + cwd=path, capture_output=True, check=True, text=True, + ).stdout.strip() + + +def _create_agent(root: Path) -> tuple[Path, str]: + d = root / "my-agent" + (d / "src" / "my_agent").mkdir(parents=True) + (d / "pyproject.toml").write_text(textwrap.dedent("""\ + [project] + name = "my-agent" + version = "0.1.0" + requires-python = ">=3.11" + [build-system] + requires = ["hatchling"] + build-backend = "hatchling.build" + [tool.hatch.build.targets.wheel] + packages = ["src/my_agent"] + """)) + (d / "src" / "my_agent" / "__init__.py").write_text('def solve(q): return "42"\n') + _git(d, "init") + _git(d, "add", ".") + _git(d, "commit", "-m", "init") + return d, _git(d, "rev-parse", "HEAD") + + +def _create_task_project(root: Path, vero_path: Path) -> Path: + d = root / "my-eval-tasks" + vt = d / "src" / "my_eval_tasks" / "vero_tasks" + vt.mkdir(parents=True) + (d / "pyproject.toml").write_text(textwrap.dedent(f"""\ + [project] + name = "my-eval-tasks" + version = "0.1.0" + requires-python = ">=3.11" + dependencies = ["scale-vero[optimize]"] + [build-system] + requires = ["hatchling"] + build-backend = "hatchling.build" + [tool.hatch.build.targets.wheel] + packages = ["src/my_eval_tasks"] + [tool.uv.sources] + scale-vero = {{ path = "{vero_path}", editable = true }} + """)) + (vt / "__init__.py").write_text("from my_eval_tasks.vero_tasks import math_task # noqa\n") + (vt / "math_task.py").write_text(textwrap.dedent("""\ + from my_agent import solve + from vero.core.db.result import TaskOutput, TaskResult + from vero.core.evaluation import EvaluationParameters + from vero.core.task import create_task + math_task = create_task("math") + @math_task.inference() + async def run_inference(task, evaluation_parameters): + return TaskOutput(output=solve(task["question"])) + @math_task.evaluation() + async def evaluate(task, output, evaluation_parameters): + return TaskResult(output=output.output, score=1.0 if output.output == task["expected"] else 0.0) + """)) + subprocess.run(["uv", "sync"], cwd=d, capture_output=True, check=True) + return d + + +@pytest.fixture +def fixture(tmp_path, monkeypatch): + from vero.core.constants import PACKAGE_DIR + from datasets import Dataset, DatasetDict + + vh = tmp_path / "vero_home" + (vh / "sessions").mkdir(parents=True) + (vh / "datasets").mkdir(parents=True) + monkeypatch.setenv("VERO_HOME_DIR", str(vh)) + + agent_dir, head = _create_agent(tmp_path) + task_dir = _create_task_project(tmp_path, PACKAGE_DIR) + ds = DatasetDict({"test": Dataset.from_dict( + {"question": ["6*7?", "2+2?"], "expected": ["42", "4"]})}) + ds_path = tmp_path / "ds" + ds.save_to_disk(str(ds_path)) + dataset_id = resolve_and_save_dataset(str(ds_path), vh / "sessions", vh / "datasets", "sess") + return agent_dir, head, task_dir, dataset_id, tmp_path + + +def _serve_config(agent_dir, head, task_dir, dataset_id, tmp) -> ServeConfig: + return ServeConfig( + repo_path=str(agent_dir), + agent_repo_path=str(agent_dir), + session_id="sess", + dataset_id=dataset_id, + split_accesses=[{"split": "test", "access": "non_viewable"}], + budgets=[{"split": "test", "dataset_id": dataset_id, "total_run_budget": 5}], + task="math", + task_project=str(task_dir), + task_module="my_eval_tasks.vero_tasks", + reward_mode="auto_best", + selection_split="test", + targets=[{"task": "math", "dataset_id": dataset_id, "split": "test", "reward_key": "reward", "sample_ids": [0]}], + agent_volume=str(tmp / "agent_vol"), + admin_volume=str(tmp / "admin_vol"), + admin_token_path=str(tmp / "admin_vol" / "token"), + timeout=300, + ) + + +@pytest.mark.asyncio +async def test_serve_assembles_and_evaluates_and_finalizes(fixture): + agent_dir, head, task_dir, dataset_id, tmp = fixture + config = _serve_config(agent_dir, head, task_dir, dataset_id, tmp) + + sidecar, verifier, token = await build_components(config) + assert token and (tmp / "admin_vol" / "token").read_text() == token + + # real eval (no LLM): sample 0 expects "42", agent solves -> "42" -> score 1.0 + exp = await sidecar.engine.evaluate( + EvalRequest(dataset_id=dataset_id, split="test", commit=head, sample_ids=[0]) + ) + assert exp.result.sample_results[0].score == 1.0 + + # verifier selects the (only) candidate on "test" and scores it on the test target + rewards = await verifier.finalize() + assert rewards["reward"] == 1.0 diff --git a/vero/tests/test_harbor_server.py b/vero/tests/test_harbor_server.py new file mode 100644 index 0000000..16986a4 --- /dev/null +++ b/vero/tests/test_harbor_server.py @@ -0,0 +1,124 @@ +"""Tests for vero.harbor.server.EvaluationSidecar — handlers, tier-routing, submit.""" + +import json +from unittest.mock import AsyncMock, MagicMock + +import pytest + +from vero.core.budget import BudgetLedger, SplitBudget +from vero.core.dataset.base import SplitAccess +from vero.core.db.candidate import Candidate +from vero.core.db.dataset import DatasetSample, DatasetSubset +from vero.core.db.database import Experiment +from vero.core.db.result import ( + ExperimentResult, + ExperimentResultStatus, + SampleResult, +) +from vero.core.db.run import ExperimentRun +from vero.harbor.server import EvaluationSidecar, SubmitDisabledError +from vero.evaluation.engine import EvalRequest + + +def _experiment(split: str, commit: str = "abcdef123456") -> Experiment: + run = ExperimentRun( + candidate=Candidate(commit=commit, repo_name="r"), + dataset_subset=DatasetSubset(split=split, dataset_id="ds1"), + ) + sample_results = { + i: SampleResult( + dataset_sample=DatasetSample(sample_id=i, split=split, dataset_id="ds1"), + score=float(i % 2), + feedback=f"Expected: secret-{i}", # label-bearing: must NOT reach agent on partial + ) + for i in range(3) + } + return Experiment( + run=run, + result=ExperimentResult( + run_id=run.id, status=ExperimentResultStatus.SUCCESS, sample_results=sample_results + ), + ) + + +def _sidecar(tmp_path, *, split, submit_enabled=False): + engine = MagicMock() + engine.evaluate = AsyncMock(return_value=_experiment(split)) + engine.budget = BudgetLedger( + [SplitBudget(split=split, dataset_id="ds1", total_run_budget=5, total_sample_budget=100)] + ) + sidecar = EvaluationSidecar( + engine=engine, + split_accesses=[SplitAccess.non_viewable("validation"), SplitAccess.no_access("test")], + agent_repo_path=tmp_path / "agent_repo", + agent_volume=tmp_path / "agent_vol", + admin_volume=tmp_path / "admin_vol", + submit_enabled=submit_enabled, + ) + # Stub the git transfer (integration-tested separately); pin the sha. + sidecar._transfer_commit = AsyncMock(return_value="abcdef123456") + return sidecar + + +class TestRouting: + @pytest.mark.asyncio + async def test_visible_split_writes_full_per_sample(self, tmp_path): + sidecar = _sidecar(tmp_path, split="train") # train defaults to viewable + summary = await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="train")) + + dest = tmp_path / "agent_vol" / "results" / "train__abcdef123456" + assert (dest / "summary.json").exists() + assert {(dest / f"{i}.json").exists() for i in range(3)} == {True} + assert summary.result_path == str(dest) + assert summary.n_samples == 3 + + @pytest.mark.asyncio + async def test_partial_split_writes_summary_only_no_labels(self, tmp_path): + sidecar = _sidecar(tmp_path, split="validation") # non_viewable -> partial + summary = await sidecar.evaluate(EvalRequest(dataset_id="ds1", split="validation")) + + dest = tmp_path / "agent_vol" / "results" / "validation__abcdef123456" + assert (dest / "summary.json").exists() + # NO per-sample files -> the label-bearing feedback never reaches the agent + assert not list(dest.glob("[0-9]*.json")) + blob = (dest / "summary.json").read_text() + assert "secret-" not in blob + assert summary.result_path == str(dest) + + @pytest.mark.asyncio + async def test_admin_eval_writes_nothing_to_agent_volume(self, tmp_path): + sidecar = _sidecar(tmp_path, split="test") # no_access; admin only + summary = await sidecar.evaluate( + EvalRequest(dataset_id="ds1", split="test"), admin=True + ) + assert not (tmp_path / "agent_vol").exists() or not list( + (tmp_path / "agent_vol").rglob("*.json") + ) + assert summary.result_path is None + # admin call bypasses metering + assert summary.budget_remaining is None + + +class TestSubmit: + @pytest.mark.asyncio + async def test_submit_records_nomination(self, tmp_path): + sidecar = _sidecar(tmp_path, split="train", submit_enabled=True) + out = await sidecar.submit(commit="deadbeef") + rec = json.loads((tmp_path / "admin_vol" / "submission.json").read_text()) + assert rec["commit"] == "abcdef123456" # the transferred sha + assert out["submitted_commit"] == "abcdef123456" + + @pytest.mark.asyncio + async def test_submit_disabled_raises(self, tmp_path): + sidecar = _sidecar(tmp_path, split="train", submit_enabled=False) + with pytest.raises(SubmitDisabledError): + await sidecar.submit(commit="x") + + +class TestStatus: + def test_status_reports_submit_and_splits(self, tmp_path): + sidecar = _sidecar(tmp_path, split="train", submit_enabled=True) + status = sidecar.status() + assert status.submit_enabled is True + assert status.splits[0]["split"] == "train" + assert status.splits[0]["remaining_run_budget"] == 5 diff --git a/vero/tests/test_harbor_transfer.py b/vero/tests/test_harbor_transfer.py new file mode 100644 index 0000000..00afae5 --- /dev/null +++ b/vero/tests/test_harbor_transfer.py @@ -0,0 +1,85 @@ +"""Integration test for EvaluationSidecar._transfer_commit (real git repos). + +Validates that a commit is fetched from the (untrusted) mounted agent repo into +the sidecar's own repo and resolved to its sha — the one server.py piece +that can't be unit-tested with mocks. +""" + +import subprocess +from pathlib import Path +from unittest.mock import MagicMock + +import pytest + +from vero.harbor.server import EvaluationSidecar +from vero.sandbox import LocalSandbox +from vero.workspace.git import GitWorkspace + + +def _git(cwd: Path, *args: str) -> str: + return subprocess.run( + ["git", "-c", "user.email=t@t", "-c", "user.name=t", *args], + cwd=cwd, + check=True, + capture_output=True, + text=True, + ).stdout.strip() + + +def _init_repo(path: Path, content: str) -> str: + path.mkdir(parents=True, exist_ok=True) + _git(path, "init", "-q") + (path / "f.txt").write_text(content) + _git(path, "add", "f.txt") + _git(path, "commit", "-q", "-m", "c") + return _git(path, "rev-parse", "HEAD") + + +async def _sidecar_for(sidecar_repo: Path, agent_repo: Path, tmp_path: Path): + sandbox = await LocalSandbox.create(root=tmp_path) + workspace = await GitWorkspace.from_path(sandbox, str(sidecar_repo)) + engine = MagicMock() + engine.evaluator.workspace = workspace + return EvaluationSidecar( + engine=engine, + split_accesses=[], + agent_repo_path=agent_repo, + agent_volume=tmp_path / "av", + admin_volume=tmp_path / "adv", + ) + + +@pytest.mark.asyncio +async def test_transfer_fetches_agent_head_into_sidecar_repo(tmp_path): + agent_repo = tmp_path / "agent" + sidecar_repo = tmp_path / "sidecar" + agent_head = _init_repo(agent_repo, "agent work") + _init_repo(sidecar_repo, "sidecar base") + + sidecar = await _sidecar_for(sidecar_repo, agent_repo, tmp_path) + sha = await sidecar._transfer_commit(None) # default = agent HEAD + + assert sha == agent_head + # the fetched commit object now lives in the sidecar's own repo (tamper-evident copy) + assert ( + subprocess.run( + ["git", "-C", str(sidecar_repo), "cat-file", "-e", sha], capture_output=True + ).returncode + == 0 + ) + + +@pytest.mark.asyncio +async def test_transfer_explicit_ref(tmp_path): + agent_repo = tmp_path / "agent" + sidecar_repo = tmp_path / "sidecar" + _init_repo(agent_repo, "first") + # a second commit; transfer the first by explicit sha + first = _git(agent_repo, "rev-parse", "HEAD") + (agent_repo / "f.txt").write_text("second") + _git(agent_repo, "commit", "-aqm", "second") + _init_repo(sidecar_repo, "sidecar base") + + sidecar = await _sidecar_for(sidecar_repo, agent_repo, tmp_path) + sha = await sidecar._transfer_commit(first) + assert sha == first diff --git a/vero/tests/test_harbor_verifier.py b/vero/tests/test_harbor_verifier.py new file mode 100644 index 0000000..59b878b --- /dev/null +++ b/vero/tests/test_harbor_verifier.py @@ -0,0 +1,112 @@ +"""Tests for vero.harbor.verifier.Verifier — selection + multi-target scoring.""" + +import json +from unittest.mock import AsyncMock, MagicMock + +import pandas as pd +import pytest + +from vero.harbor.verifier import NoCandidateError, VerificationTarget, Verifier + + +def _engine(scores_by_call): + engine = MagicMock() + engine.evaluate_admin = AsyncMock( + side_effect=[MagicMock(result=MagicMock(score=MagicMock(return_value=s))) for s in scores_by_call] + ) + return engine + + +class TestSubmitSelection: + @pytest.mark.asyncio + async def test_finalize_submit_scores_nominated_commit(self, tmp_path): + (tmp_path / "submission.json").write_text(json.dumps({"commit": "deadbeef"})) + engine = _engine([0.8]) + v = Verifier( + engine=engine, + admin_volume=tmp_path, + reward_mode="submit", + targets=[VerificationTarget(task="t", dataset_id="ds1", split="test", reward_key="reward")], + ) + rewards = await v.finalize() + assert rewards == {"reward": 0.8} + assert engine.evaluate_admin.await_args.kwargs["commit"] == "deadbeef" + assert engine.evaluate_admin.await_args.kwargs["split"] == "test" + + @pytest.mark.asyncio + async def test_finalize_submit_no_submission_raises(self, tmp_path): + v = Verifier( + engine=_engine([]), + admin_volume=tmp_path, + reward_mode="submit", + targets=[VerificationTarget(task="t", dataset_id="ds1", split="test", reward_key="reward")], + ) + with pytest.raises(NoCandidateError): + await v.finalize() + + +class TestMultiTarget: + @pytest.mark.asyncio + async def test_finalize_emits_multiple_reward_keys(self, tmp_path): + (tmp_path / "submission.json").write_text(json.dumps({"commit": "c1"})) + engine = _engine([0.9, 0.4]) + v = Verifier( + engine=engine, + admin_volume=tmp_path, + reward_mode="submit", + targets=[ + VerificationTarget(task="t", dataset_id="ds1", split="test", reward_key="in_domain"), + VerificationTarget(task="t2", dataset_id="ds2", split="test", reward_key="held_out"), + ], + ) + rewards = await v.finalize() + assert rewards == {"in_domain": 0.9, "held_out": 0.4} + assert engine.evaluate_admin.await_count == 2 + + +class TestAutoBestSelection: + @pytest.mark.asyncio + async def test_finalize_auto_best_picks_top_validation_score(self, tmp_path): + engine = _engine([0.95]) + engine.db.get_experiments_df.return_value = pd.DataFrame( + { + "dataset_subset_split": ["validation", "validation", "train"], + "candidate_commit": ["lo", "hi", "ignored"], + "mean_score": [0.5, 0.9, 1.0], + "candidate_created_at": [1, 2, 3], + } + ) + v = Verifier( + engine=engine, + admin_volume=tmp_path, + reward_mode="auto_best", + selection_split="validation", + targets=[VerificationTarget(task="t", dataset_id="ds1", split="test", reward_key="reward")], + ) + rewards = await v.finalize() + assert rewards == {"reward": 0.95} + # selected the highest validation score ("hi"), not the train row + assert engine.evaluate_admin.await_args.kwargs["commit"] == "hi" + + @pytest.mark.asyncio + async def test_finalize_auto_best_excludes_baseline(self, tmp_path): + engine = _engine([0.7]) + engine.db.get_experiments_df.return_value = pd.DataFrame( + { + "dataset_subset_split": ["validation", "validation"], + "candidate_commit": ["base", "agent"], + "mean_score": [0.99, 0.6], + "candidate_created_at": [1, 2], + } + ) + v = Verifier( + engine=engine, + admin_volume=tmp_path, + reward_mode="auto_best", + selection_split="validation", + base_commit="base", + targets=[VerificationTarget(task="t", dataset_id="ds1", split="test", reward_key="reward")], + ) + await v.finalize() + # baseline excluded even though it scored higher + assert engine.evaluate_admin.await_args.kwargs["commit"] == "agent" diff --git a/vero/tests/test_isolate_project.py b/vero/tests/test_isolate_project.py index bc94a03..d41071a 100644 --- a/vero/tests/test_isolate_project.py +++ b/vero/tests/test_isolate_project.py @@ -1,11 +1,10 @@ """Tests for project isolation with dependency resolution.""" import subprocess -from pathlib import Path import pytest -from vero.evaluator import _resolve_vero_dependency +from vero.evaluation.evaluator import _resolve_vero_dependency @pytest.fixture diff --git a/vero/tests/test_task.py b/vero/tests/test_task.py index 4d5972e..274067b 100644 --- a/vero/tests/test_task.py +++ b/vero/tests/test_task.py @@ -2,7 +2,6 @@ from __future__ import annotations -import warnings import pytest from pydantic import ValidationError @@ -369,3 +368,46 @@ async def infer(task, evaluation_parameters): params = _make_eval_params() with pytest.raises(RuntimeError, match="No evaluation function"): await t.run(params) + + +# --------------------------------------------------------------------------- +# Label scrubbing (Mode A) +# --------------------------------------------------------------------------- + + +class TestLabelScrubbing: + def test_scrub_inputs_helper(self): + t = create_task("scrub-helper", register=False, label_fields=["answer"]) + # mapping rows have label fields removed + assert t._scrub_inputs({"q": "x", "answer": "y"}) == {"q": "x"} + # non-mapping rows pass through unchanged + assert t._scrub_inputs("notadict") == "notadict" + # no label_fields configured -> no-op + t2 = create_task("scrub-helper-2", register=False) + assert t2._scrub_inputs({"q": "x", "answer": "y"}) == {"q": "x", "answer": "y"} + + @pytest.mark.asyncio + async def test_inference_never_sees_labels_scoring_does(self): + t = create_task("scrub-e2e", register=False, label_fields=["answer"]) + seen_by_inference = {} + + @t.load_data() + def load(evaluation_parameters): + return [{"q": "2+2", "answer": "4"}] + + @t.inference() + async def infer(task, evaluation_parameters): + seen_by_inference["keys"] = sorted(task.keys()) + return TaskOutput(output="4") + + @t.evaluation() + async def evaluate(task, output, evaluation_parameters): + # scoring receives the full row, including the label + assert "answer" in task + return TaskResult(score=1.0 if output.output == task["answer"] else 0.0) + + params = _make_eval_params(num_samples=1) + metrics = await t.run(params) + + assert seen_by_inference["keys"] == ["q"] # label stripped from inference + assert metrics["avg_score"] == 1.0 diff --git a/vero/tests/test_task_metrics.py b/vero/tests/test_task_metrics.py index dd13f73..4ae6f93 100644 --- a/vero/tests/test_task_metrics.py +++ b/vero/tests/test_task_metrics.py @@ -6,7 +6,7 @@ import pytest -from vero.evaluator import Evaluator +from vero.evaluation.evaluator import Evaluator from vero.utils.asyncio import SubprocessResult pytestmark = pytest.mark.asyncio @@ -43,8 +43,8 @@ def fake_subprocess(*args, **kwargs): returncode=0, ) - with patch("vero.evaluator.run_subprocess_with_tee", new=AsyncMock(side_effect=fake_subprocess)): - with patch("vero.evaluator.UvRunParameters.from_env", return_value=MagicMock(get_cmd=lambda: ["uv", "run"])): + with patch("vero.evaluation.evaluator.run_subprocess_with_tee", new=AsyncMock(side_effect=fake_subprocess)): + with patch("vero.evaluation.evaluator.UvRunParameters.from_env", return_value=MagicMock(get_cmd=lambda: ["uv", "run"])): result = await evaluator._run_task( Path("/fake/project"), "test_task", params_file ) @@ -59,8 +59,8 @@ async def test_run_task_returns_none_when_no_metrics_file(evaluator, experiment_ def fake_subprocess(*args, **kwargs): return SubprocessResult(args=["fake"], stdout="", stderr="", returncode=0) - with patch("vero.evaluator.run_subprocess_with_tee", new=AsyncMock(side_effect=fake_subprocess)): - with patch("vero.evaluator.UvRunParameters.from_env", return_value=MagicMock(get_cmd=lambda: ["uv", "run"])): + with patch("vero.evaluation.evaluator.run_subprocess_with_tee", new=AsyncMock(side_effect=fake_subprocess)): + with patch("vero.evaluation.evaluator.UvRunParameters.from_env", return_value=MagicMock(get_cmd=lambda: ["uv", "run"])): result = await evaluator._run_task( Path("/fake/project"), "test_task", params_file ) @@ -76,8 +76,8 @@ def fake_subprocess(*args, **kwargs): (tmp_path / "metrics.json").write_text("not valid json {{{") return SubprocessResult(args=["fake"], stdout="", stderr="", returncode=0) - with patch("vero.evaluator.run_subprocess_with_tee", new=AsyncMock(side_effect=fake_subprocess)): - with patch("vero.evaluator.UvRunParameters.from_env", return_value=MagicMock(get_cmd=lambda: ["uv", "run"])): + with patch("vero.evaluation.evaluator.run_subprocess_with_tee", new=AsyncMock(side_effect=fake_subprocess)): + with patch("vero.evaluation.evaluator.UvRunParameters.from_env", return_value=MagicMock(get_cmd=lambda: ["uv", "run"])): result = await evaluator._run_task( Path("/fake/project"), "test_task", params_file ) @@ -98,8 +98,8 @@ def fake_subprocess(*args, **kwargs): returncode=0, ) - with patch("vero.evaluator.run_subprocess_with_tee", new=AsyncMock(side_effect=fake_subprocess)): - with patch("vero.evaluator.UvRunParameters.from_env", return_value=MagicMock(get_cmd=lambda: ["uv", "run"])): + with patch("vero.evaluation.evaluator.run_subprocess_with_tee", new=AsyncMock(side_effect=fake_subprocess)): + with patch("vero.evaluation.evaluator.UvRunParameters.from_env", return_value=MagicMock(get_cmd=lambda: ["uv", "run"])): await evaluator._run_task(Path("/fake/project"), "test_task", params_file) assert (tmp_path / "subprocess_stdout.log").read_text() == "some stdout" diff --git a/vero/tests/test_uv_with_editable.py b/vero/tests/test_uv_with_editable.py index 8e9b891..07171ae 100644 --- a/vero/tests/test_uv_with_editable.py +++ b/vero/tests/test_uv_with_editable.py @@ -8,7 +8,6 @@ from __future__ import annotations -import asyncio import subprocess import textwrap from pathlib import Path diff --git a/vero/uv.lock b/vero/uv.lock index 27fa2eb..b1b98ac 100644 --- a/vero/uv.lock +++ b/vero/uv.lock @@ -913,6 +913,22 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/c1/ea/53f2148663b321f21b5a606bd5f191517cf40b7072c0497d3c92c4a13b1e/executing-2.2.1-py2.py3-none-any.whl", hash = "sha256:760643d3452b4d777d295bb167ccc74c64a81df23fb5e08eff250c425a4b2017", size = 28317, upload-time = "2025-09-01T09:48:08.5Z" }, ] +[[package]] +name = "fastapi" +version = "0.137.1" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "annotated-doc" }, + { name = "pydantic" }, + { name = "starlette" }, + { name = "typing-extensions" }, + { name = "typing-inspection" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/d5/b1/e5b92c59d2c37817e77c1a8c2fc1f79cdcc04c68253e5406b43e3204cba7/fastapi-0.137.1.tar.gz", hash = "sha256:822360704230d9533d8d9475399613525968aa2f0b5bd2a3ccc9f18c88fd541c", size = 408293, upload-time = "2026-06-15T11:28:20.79Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/da/35/380b9a5922f4340e51c309cde09e5bd32e62f02302971bee30dc15aa0624/fastapi-0.137.1-py3-none-any.whl", hash = "sha256:64f6983c59e45c4b9fdc44e57cb8035c2451ee91ea8e8ec042aca37de7cf6b69", size = 121877, upload-time = "2026-06-15T11:28:19.523Z" }, +] + [[package]] name = "fastjsonschema" version = "2.21.2" @@ -4251,7 +4267,9 @@ dependencies = [ { name = "datasets" }, { name = "pydantic" }, { name = "python-dotenv" }, + { name = "pyyaml" }, { name = "requests" }, + { name = "rich" }, { name = "s3fs" }, { name = "tenacity" }, { name = "toml" }, @@ -4269,6 +4287,12 @@ evaluate = [ { name = "haikunator" }, { name = "rich" }, ] +harbor = [ + { name = "fastapi" }, + { name = "httpx" }, + { name = "jinja2" }, + { name = "uvicorn" }, +] jupyter = [ { name = "jupyter" }, ] @@ -4324,8 +4348,11 @@ requires-dist = [ { name = "datasets", specifier = ">=4.3.0" }, { name = "datasets", marker = "extra == 'optimize'", specifier = ">=4.3.0" }, { name = "docker", marker = "extra == 'docker'", specifier = ">=7.1.0" }, + { name = "fastapi", marker = "extra == 'harbor'", specifier = ">=0.110" }, { name = "haikunator", marker = "extra == 'evaluate'", specifier = ">=2.1.0" }, { name = "haikunator", marker = "extra == 'optimize'", specifier = ">=2.1.0" }, + { name = "httpx", marker = "extra == 'harbor'", specifier = ">=0.27" }, + { name = "jinja2", marker = "extra == 'harbor'", specifier = ">=3.1.6" }, { name = "jinja2", marker = "extra == 'optimize'", specifier = ">=3.1.6" }, { name = "jupyter", marker = "extra == 'jupyter'", specifier = ">=1.1.1" }, { name = "jupyterlab", marker = "extra == 'notebook'", specifier = ">=4.5.2" }, @@ -4338,7 +4365,9 @@ requires-dist = [ { name = "pydantic", specifier = ">=2.11.7" }, { name = "pypdf", marker = "extra == 'optimize'", specifier = ">=6.2.0" }, { name = "python-dotenv", specifier = ">=1.2.2" }, + { name = "pyyaml", specifier = ">=6.0" }, { name = "requests", specifier = ">=2.32.5" }, + { name = "rich", specifier = ">=13.9.4" }, { name = "rich", marker = "extra == 'evaluate'", specifier = ">=13.9.4" }, { name = "rich", marker = "extra == 'optimize'", specifier = ">=13.9.4" }, { name = "s3fs", specifier = ">=2025.9.0" }, @@ -4349,10 +4378,11 @@ requires-dist = [ { name = "toml", specifier = ">=0.10.2" }, { name = "tqdm", specifier = ">=4.67.1" }, { name = "trafilatura", marker = "extra == 'optimize'", specifier = ">=2.0.0" }, + { name = "uvicorn", marker = "extra == 'harbor'", specifier = ">=0.27" }, { name = "wandb", marker = "extra == 'wandb'", specifier = ">=0.2.5" }, { name = "wcmatch", marker = "extra == 'optimize'", specifier = ">=10.1" }, ] -provides-extras = ["wandb", "sgp", "docker", "claude", "optimize", "jupyter", "kaggle", "evaluate", "plot", "notebook"] +provides-extras = ["wandb", "sgp", "docker", "claude", "harbor", "optimize", "jupyter", "kaggle", "evaluate", "plot", "notebook"] [package.metadata.requires-dev] dev = [