Add Harbor integration: optimization-as-a-Harbor-task#2
Draft
varunursekar wants to merge 1 commit into
Draft
Conversation
Compile a vero optimization run into a Harbor task whose agent-under-test is an optimizer: any Harbor agent (Claude Code, an oracle script, ...) edits a target repo and spends an evaluation budget, and the reward is the best candidate's score on a hidden split. This makes optimization runs reproducible with plain `harbor run`, and leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or bypass its budget. vero core: - 3-tier split visibility (visible / non_viewable / no_access). - Resumable staged evaluation (inference vs scoring) with label scrubbing. - BudgetLedger (core/budget.py) and the EvaluationEngine / Evaluator split with an injectable eval-strategy seam (evaluation/). harbor (vero/harbor/): - EvaluationSidecar: commit transfer from the untrusted agent repo + tier-gated result write-routing; Verifier: commit selection (submit | auto_best) + hidden -split scoring; per-trial root:600 admin-token auth; a FastAPI surface (/eval, /submit, /status for the agent; /finalize for the verifier). - Two evaluation modes: Mode A (vero runs inference + scoring) and Mode B (delegated to a nested `harbor run`, e.g. on Modal) via HarborConfig/HarborRunner. - `vero harbor` CLI: build | run | serve | eval | submit | status | finalize, and a compiler (build/) that renders a runnable Harbor task (Docker Compose: optimizer workbench + eval sidecar + volumes), baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Docs and examples: - docs/harbor/architecture.md and docs/harbor/tutorial.md; a Harbor section in README. - examples/gsm8k-agent (Mode A) and examples/gaia-optimization (Mode B: a terminus-2 agent with an editable prompt, scored on GAIA via nested harbor on Modal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft for review / play — not ready to merge.
Compiles a vero optimization run into a Harbor task whose agent-under-test is an optimizer: any Harbor agent (Claude Code, an oracle script, …) edits a target repo and spends an evaluation budget, and the reward is the best candidate's score on a hidden split. This makes optimization runs reproducible with plain
harbor runand leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or bypass its budget.Where to start
docs/harbor/architecture.md— what it is, the compose topology, and the leaderboard-integrity model.docs/harbor/tutorial.md— build + run a task end to end.examples/gsm8k-agent(Mode A) andexamples/gaia-optimization(Mode B).What's here
BudgetLedger, and theEvaluationEngine/Evaluatorsplit with an injectable eval-strategy seam.vero/harbor/):EvaluationSidecar(commit transfer + tier-gated result write-routing),Verifier(commit selection + hidden-split scoring), per-trialroot:600admin-token auth, and a FastAPI surface (/eval,/submit,/statusfor the agent;/finalizefor the verifier).harbor run, e.g. on Modal).vero harborCLI:build | run | serve | eval | submit | status | finalize+ a compiler that renders a runnable Harbor task (Docker Compose: optimizer workbench + eval sidecar + volumes).Validation
End-to-end runs producing real rewards: Mode A on gsm8k (reward 1.0 via an OpenAI-compatible gateway), Mode B with a custom agent on Modal (reward 1.0), and a Claude-Code optimizer tuning a terminus agent on a GAIA 5+5 subset (accuracy 0.4 on the hidden split). Harbor unit suite: 103 passing.
Notes for review
harbor-substrate-v2.gaia-optimizationexample hasn't been re-run (same machinery as the validated runs; costs Modal credits). Easy to run via the example README.🤖 Generated with Claude Code