Skip to content

Add Harbor integration: optimization-as-a-Harbor-task#2

Draft
varunursekar wants to merge 1 commit into
mainfrom
harbor-integration
Draft

Add Harbor integration: optimization-as-a-Harbor-task#2
varunursekar wants to merge 1 commit into
mainfrom
harbor-integration

Conversation

@varunursekar

Copy link
Copy Markdown

Draft for review / play — not ready to merge.

Compiles a vero optimization run into a Harbor task whose agent-under-test is an optimizer: any Harbor agent (Claude Code, an oracle script, …) edits a target repo and spends an evaluation budget, and the reward is the best candidate's score on a hidden split. This makes optimization runs reproducible with plain harbor run and leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or bypass its budget.

Where to start

  • docs/harbor/architecture.md — what it is, the compose topology, and the leaderboard-integrity model.
  • docs/harbor/tutorial.md — build + run a task end to end.
  • examples/gsm8k-agent (Mode A) and examples/gaia-optimization (Mode B).

What's here

  • vero core: 3-tier split visibility (visible / non_viewable / no_access), resumable staged evaluation with label scrubbing, BudgetLedger, and the EvaluationEngine / Evaluator split with an injectable eval-strategy seam.
  • harbor sidecar (vero/harbor/): EvaluationSidecar (commit transfer + tier-gated result write-routing), Verifier (commit selection + hidden-split scoring), per-trial root:600 admin-token auth, and a FastAPI surface (/eval, /submit, /status for the agent; /finalize for the verifier).
  • Two modes: Mode A (vero runs inference + scoring) and Mode B (delegated to a nested harbor run, e.g. on Modal).
  • vero harbor CLI: build | run | serve | eval | submit | status | finalize + a compiler that renders a runnable Harbor task (Docker Compose: optimizer workbench + eval sidecar + volumes).

Validation

End-to-end runs producing real rewards: Mode A on gsm8k (reward 1.0 via an OpenAI-compatible gateway), Mode B with a custom agent on Modal (reward 1.0), and a Claude-Code optimizer tuning a terminus agent on a GAIA 5+5 subset (accuracy 0.4 on the hidden split). Harbor unit suite: 103 passing.

Notes for review

  • Squashed to one commit; full development history is on harbor-substrate-v2.
  • The live container/Modal e2e of the committed gaia-optimization example hasn't been re-run (same machinery as the validated runs; costs Modal credits). Easy to run via the example README.

🤖 Generated with Claude Code

Compile a vero optimization run into a Harbor task whose agent-under-test is an
optimizer: any Harbor agent (Claude Code, an oracle script, ...) edits a target
repo and spends an evaluation budget, and the reward is the best candidate's
score on a hidden split. This makes optimization runs reproducible with plain
`harbor run`, and leaderboard-gradeable — the optimizer cannot read hidden
labels, modify the scorer, or bypass its budget.

vero core:
- 3-tier split visibility (visible / non_viewable / no_access).
- Resumable staged evaluation (inference vs scoring) with label scrubbing.
- BudgetLedger (core/budget.py) and the EvaluationEngine / Evaluator split with
  an injectable eval-strategy seam (evaluation/).

harbor (vero/harbor/):
- EvaluationSidecar: commit transfer from the untrusted agent repo + tier-gated
  result write-routing; Verifier: commit selection (submit | auto_best) + hidden
  -split scoring; per-trial root:600 admin-token auth; a FastAPI surface (/eval,
  /submit, /status for the agent; /finalize for the verifier).
- Two evaluation modes: Mode A (vero runs inference + scoring) and Mode B
  (delegated to a nested `harbor run`, e.g. on Modal) via HarborConfig/HarborRunner.
- `vero harbor` CLI: build | run | serve | eval | submit | status | finalize, and a
  compiler (build/) that renders a runnable Harbor task (Docker Compose: optimizer
  workbench + eval sidecar + volumes), baking the dataset/scorer/baseline repo and
  the sidecar's ServeConfig.

Docs and examples:
- docs/harbor/architecture.md and docs/harbor/tutorial.md; a Harbor section in README.
- examples/gsm8k-agent (Mode A) and examples/gaia-optimization (Mode B: a terminus-2
  agent with an editable prompt, scored on GAIA via nested harbor on Modal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpypi/​fastapi@​0.137.1100100100100100

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant