Add Harbor integration: optimization-as-a-Harbor-task by varunursekar · Pull Request #2 · scaleapi/vero

varunursekar · 2026-06-16T17:21:44Z

Draft for review / play — not ready to merge.

Compiles a vero optimization run into a Harbor task whose agent-under-test is an optimizer: any Harbor agent (Claude Code, an oracle script, …) edits a target repo and spends an evaluation budget, and the reward is the best candidate's score on a hidden split. This makes optimization runs reproducible with plain harbor run and leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or bypass its budget.

Where to start

docs/harbor/architecture.md — what it is, the compose topology, and the leaderboard-integrity model.
docs/harbor/tutorial.md — build + run a task end to end.
examples/gsm8k-agent (Mode A) and examples/gaia-optimization (Mode B).

What's here

vero core: 3-tier split visibility (visible / non_viewable / no_access), resumable staged evaluation with label scrubbing, BudgetLedger, and the EvaluationEngine / Evaluator split with an injectable eval-strategy seam.
harbor sidecar (vero/harbor/): EvaluationSidecar (commit transfer + tier-gated result write-routing), Verifier (commit selection + hidden-split scoring), per-trial root:600 admin-token auth, and a FastAPI surface (/eval, /submit, /status for the agent; /finalize for the verifier).
Two modes: Mode A (vero runs inference + scoring) and Mode B (delegated to a nested harbor run, e.g. on Modal).
vero harbor CLI: build | run | serve | eval | submit | status | finalize + a compiler that renders a runnable Harbor task (Docker Compose: optimizer workbench + eval sidecar + volumes).

Validation

End-to-end runs producing real rewards: Mode A on gsm8k (reward 1.0 via an OpenAI-compatible gateway), Mode B with a custom agent on Modal (reward 1.0), and a Claude-Code optimizer tuning a terminus agent on a GAIA 5+5 subset (accuracy 0.4 on the hidden split). Harbor unit suite: 103 passing.

Notes for review

Squashed to one commit; full development history is on harbor-substrate-v2.
The live container/Modal e2e of the committed gaia-optimization example hasn't been re-run (same machinery as the validated runs; costs Modal credits). Easy to run via the example README.

🤖 Generated with Claude Code

Compile a vero optimization run into a Harbor task whose agent-under-test is an optimizer: any Harbor agent (Claude Code, an oracle script, ...) edits a target repo and spends an evaluation budget, and the reward is the best candidate's score on a hidden split. This makes optimization runs reproducible with plain `harbor run`, and leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or bypass its budget. vero core: - 3-tier split visibility (visible / non_viewable / no_access). - Resumable staged evaluation (inference vs scoring) with label scrubbing. - BudgetLedger (core/budget.py) and the EvaluationEngine / Evaluator split with an injectable eval-strategy seam (evaluation/). harbor (vero/harbor/): - EvaluationSidecar: commit transfer from the untrusted agent repo + tier-gated result write-routing; Verifier: commit selection (submit | auto_best) + hidden -split scoring; per-trial root:600 admin-token auth; a FastAPI surface (/eval, /submit, /status for the agent; /finalize for the verifier). - Two evaluation modes: Mode A (vero runs inference + scoring) and Mode B (delegated to a nested `harbor run`, e.g. on Modal) via HarborConfig/HarborRunner. - `vero harbor` CLI: build | run | serve | eval | submit | status | finalize, and a compiler (build/) that renders a runnable Harbor task (Docker Compose: optimizer workbench + eval sidecar + volumes), baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Docs and examples: - docs/harbor/architecture.md and docs/harbor/tutorial.md; a Harbor section in README. - examples/gsm8k-agent (Mode A) and examples/gaia-optimization (Mode B: a terminus-2 agent with an editable prompt, scored on GAIA via nested harbor on Modal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

socket-security · 2026-06-16T17:22:32Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	pypi/fastapi@0.137.1

View full report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Harbor integration: optimization-as-a-Harbor-task#2

Add Harbor integration: optimization-as-a-Harbor-task#2
varunursekar wants to merge 1 commit into
mainfrom
harbor-integration

varunursekar commented Jun 16, 2026

Uh oh!

socket-security Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

varunursekar commented Jun 16, 2026

Where to start

What's here

Validation

Notes for review

Uh oh!

socket-security Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant