-
Notifications
You must be signed in to change notification settings - Fork 59
Feat: hud-python sdk v6 #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
200 commits
Select commit
Hold shift + click to select a range
d2e8a8d
drop v4 task compatibility
jdchawla29 2e937d4
Merge pull request #403 from hud-evals/codex/drop-v4-support
jdchawla29 4f37307
Align docs with v4 support removal
jdchawla29 0f19561
Fix public docs SDK imports
jdchawla29 66afab0
v5 regression tests
jdchawla29 a2bb01c
Decouple agent native tools from environment primitives
jdchawla29 63165d0
tool updates
jdchawla29 7bfbdc6
Merge pull request #407 from hud-evals/decouple-agent-tools
jdchawla29 5866ecb
Merge remote-tracking branch 'origin/main' into v6
jdchawla29 a43c5c0
small gitignore
lorenss-m eeef96f
refactor OpenAIChatAgent into openai_compatible package
jdchawla29 9366a1a
agent updates
jdchawla29 18306c5
Merge pull request #413 from hud-evals/j/agent-updates
jdchawla29 2330b9e
add AGENTS.md
jdchawla29 9442766
add init env
lorenss-m 1576dee
Merge branch 'v6' of https://github.com/hud-evals/hud-python into l/v…
lorenss-m 78f5461
simplify fx
lorenss-m 0c84a19
fx
lorenss-m 9d7696f
Update .gitignore
jdchawla29 c8d3a1b
Isolate agent run state
jdchawla29 89c3138
add more testing guideliens to AGENTS.md
jdchawla29 4f494b0
fix imports
jdchawla29 93ce003
simplify tool name handling
jdchawla29 70de8c7
agent context with top-level system prompt and citation options
jdchawla29 f92e707
tests updated
jdchawla29 e1d420c
restructure + claude [in progress, openai/gemini not done]
lorenss-m e285d66
rfb + runnable test [in progress}
lorenss-m beecc36
refactor openai + gemini
lorenss-m 8181d2e
fx
lorenss-m f33c7ee
imp and warmup
lorenss-m 3056a9f
mm fix
lorenss-m 1751b40
claude sdk
lorenss-m ae04127
fx win outputs
lorenss-m 9b0dec6
fx
lorenss-m e96ff9d
add inference-side instrumentation
jdchawla29 3921da2
fx
lorenss-m 145759a
add bu fix claude
lorenss-m ea185ce
additions
lorenss-m fda0479
fxs
lorenss-m 3a11712
add impl tinker api support + reward system
lorenss-m 429ec15
Merge branch 'v6' of https://github.com/hud-evals/hud-python into v6-…
lorenss-m 123fc16
temp: removing side-effects from importing hud.types
jdchawla29 d4b85b8
fix rollouts
lorenss-m 8929f9b
temp: fix 2
jdchawla29 c07895e
fix running
lorenss-m c21f27d
add eval flows
lorenss-m 6563750
telem
lorenss-m 7e2b7df
small change
jdchawla29 542b7d4
add legacy improvements, cleanup
lorenss-m 026fd9d
cleanup
lorenss-m 52623b1
cleanup
lorenss-m 3684598
fxs
lorenss-m b3fdb38
better legacy compat
lorenss-m 9b44b85
tests time
lorenss-m 4ba5a0f
fxs
lorenss-m 29a0fb1
fix tests
lorenss-m 4dcf91d
full tests and cleanup
lorenss-m cc7bb2d
Merge v6 into v6-agent-f-l (ours)
jdchawla29 2a356e3
Merge pull request #415 from hud-evals/v6-agent-f-l
jdchawla29 40d5db6
cleanup and add task cli
lorenss-m 4c7c5f1
rm push
lorenss-m 55b3ce8
improve readme and convert
lorenss-m bf60f0e
fxs
lorenss-m 2fb7aef
V6 contrainer mgmt (#416)
lorenss-m 9be51b2
refactor: decouple job registration from telemetry
jdchawla29 9bc8e78
docs
lorenss-m d67592f
Merge branch 'v6-contrainer-mgmt' of https://github.com/hud-evals/hud…
lorenss-m 54cad0c
changes in task and environment structure, replacing references to 'v…
jdchawla29 75b380e
refactor 1
jdchawla29 8613869
consolidation
jdchawla29 cfead4f
consolidate 2
jdchawla29 495139c
remove hud build
jdchawla29 96ea421
refactor
jdchawla29 2ed744c
refactor 2
jdchawla29 55ebe7e
cleanup
jdchawla29 467c7a4
restructure
jdchawla29 f3041a3
clean
jdchawla29 1b01b65
cookbooks
jdchawla29 3876bb0
utils
jdchawla29 82fcff6
delt
jdchawla29 8223526
restructure
jdchawla29 f74ab32
works on my machine
jdchawla29 0577a25
small clean
jdchawla29 98a67c6
small docs improvements and cli ux
lorenss-m d5f1f57
fxs
lorenss-m 95f61b5
rm skill
lorenss-m 2735555
update docs
lorenss-m cab7ee4
robot: add robot capability, environment.robots, and episode recorder
lukass16 40ca44a
final
jdchawla29 55afb33
docs
jdchawla29 820f76c
agent-side robot concerns to sdk
lukass16 23100a9
Merge origin/v6: docs tone/structure + CLI UX, keeping simplify API
jdchawla29 c515164
docs 2
jdchawla29 f141da1
pyright
jdchawla29 b09f8b7
Refactor task ID handling to strip environment prefixes for local tas…
jdchawla29 bf78a10
Merge pull request #418 from hud-evals/simplify
jdchawla29 f33d0c7
change env side robot telemetry
lukass16 3516852
Merge origin/v6 into v6-robot
lukass16 326dbf7
update robot telemetry
lukass16 56dfef6
update robot docs
lukass16 1231204
docs and matching
lukass16 4fafa69
fix matching
lukass16 5c41356
add ensembler
lukass16 0245d56
fix queue
lukass16 b2ff1d8
remove arbitrary tests, update adapter
lukass16 62a1554
small reliability fixes
lorenss-m 03f12b7
clean robot agent w/out tracing rewrite
lukass16 c722a9c
Align platform API client with the rewrite control plane
jdchawla29 8b15400
refactor datasaving
lukass16 ffdf742
undo delete
lukass16 e5f1edb
clean sim runner
lukass16 772d782
remove contracts
lukass16 5b7110a
Simplify the v6 contract surfaces
jdchawla29 a98b289
Merge remote-tracking branch 'origin/v6' into v6-robot
lukass16 1c7c058
Align v6 SDK and CLI surfaces with the rewrite control plane
jdchawla29 e553c9f
feat(eval): v6 placement model — Provider/HUDRuntime, run atom, agent…
jdchawla29 2f64fe7
fix(gateway): carry HUD key in x-goog-api-key for the gemini client
jdchawla29 ca1c834
test(eval): fix taskset export fixture to the canonical CP wire shape
jdchawla29 ec68b92
clean telemetry
lukass16 2e10145
Merge remote-tracking branch 'origin/v6' into v6-robot
lukass16 d5f7bc8
remove realtime for now
lukass16 67fd2b9
small fixes for platform
lukass16 e3520e2
remove data saving
lukass16 d62a651
keying and small ux updates, cleanup and dep mgmt
lorenss-m c308d1a
refactor and improve docs cadence
lorenss-m 68ea5b6
update endpoint
lukass16 e72a3eb
docs
lorenss-m 2a07225
docs adjustment
lorenss-m a472623
Merge branch 'v6-l-clean' of https://github.com/hud-evals/hud-python …
lorenss-m db58f86
align robot and docs, format and fixes
lorenss-m e34335c
fxs
lorenss-m 9f67834
Merge pull request #419 from hud-evals/v6-robot
lorenss-m 5962b07
thread runner add
lukass16 bc06c18
capability rename
lukass16 57aceb5
small tweak in proc + flush line
lukass16 b451efd
Merge pull request #420 from hud-evals/v6-robot-2
lorenss-m 1aa4e17
linter
lorenss-m 4925ec9
improve telem exporter
lorenss-m 68007e6
docs fixes
lorenss-m 39970b0
fix rubric based grader and windows local, add convenience imports
lorenss-m 48309ff
local teleme export + windows local test
lorenss-m d7f6cc5
env var merge and proper win support
lorenss-m 1f449da
upgrade settings links
lorenss-m a4a78c7
fix: env name resolution now uses env.py declared name, instead of se…
solvemproblr b72a944
Merge pull request #422 from hud-evals/asa/environment-name-fix
jdchawla29 88ba14d
improve local observability
lorenss-m 704bca4
add better remote guidance, docs and bump version
lorenss-m c673f40
small adjustments
lorenss-m 696d15f
feat(eval): add ModalRuntime provider for per-rollout Modal sandboxes
lukass16 bb53e47
feat(eval): add DaytonaRuntime provider for per-rollout Daytona sandb…
lukass16 166c2bf
fix(environment): set _hooks_done before adding constructor capabilities
lukass16 fb27f7f
chore(eval): silence S104 on intentional 0.0.0.0 bind in ModalRuntime
lukass16 dd1e391
fix(eval): derive DaytonaRuntime command from port to avoid tunnel mi…
lukass16 acc264e
fix(eval): type casting timeout to int for Modal and Daytona
lukass16 5977d5b
fix(eval): make Daytona sandboxes ephemeral by default
lukass16 4a31f50
fix(eval): fix exception handling in _ensure_snapshot
lukass16 2bf3f11
fix(eval): kill LocalRuntime process group to prevent orphan children
lukass16 30f7d13
chore(format): apply ruff formatting to claude sdk agent and cli init
lukass16 380dc40
fix(environment): set _hooks_done before constructor capabilities loop
lukass16 420718b
feat(eval): introduce RuntimeConfig for task-level resource management
jdchawla29 04c0344
fix(eval): address runtime config CI feedback
jdchawla29 d466034
adjustments
jdchawla29 d3775af
fix(eval): keep docker image shorthand
jdchawla29 ae79946
fix(eval): reject daytona run timeouts consistently
jdchawla29 566ecfe
Merge pull request #423 from hud-evals/lukass/modal-daytona-runtimes
jdchawla29 e4aa827
chore(eval): delete loose ORPHAN_BUG.md
lukass16 d87b34c
fix(eval): always SIGKILL LocalRuntime group, not only on timeout
lukass16 33e2037
Merge branch 'v6' into lukass/local-runtime-fixes
lukass16 a93caaf
chore(eval): delete loose test_local_runtime_orphan.py
lukass16 60e55bc
docs(eval): tighten _terminate process-group comment
lukass16 bbb614d
Merge pull request #428 from hud-evals/lukass/local-runtime-fixes
jdchawla29 b013ef8
feat(filetracking): workspace file-tracking capability + telemetry
lorenss-m 99663d2
style(filetracking): move test-only pytest import under TYPE_CHECKING…
lorenss-m c35783c
fix(filetracking): address bugbot review on the observer + tracker
lorenss-m f5c4f54
fix(filetracking): keep skipped diffs pending; root-only gitignore
lorenss-m 6cd857f
fix(filetracking): gate polling on successful observer setup
lorenss-m 0e48355
fix(filetracking): degrade gracefully when capability open fails
lorenss-m 81fc8dc
Merge pull request #429 from hud-evals/l/file-tracking
lorenss-m c13c3c5
Add cloud mode HUD runtime tunnel support
jdchawla29 67a48f4
feat(eval): make HUDRuntime use runtime tunnel
jdchawla29 0ad7424
fix(eval): address runtime tunnel review feedback
jdchawla29 1afc5ad
fix(cli): let runtime override remote flag
jdchawla29 47f1064
fix(cli): reject conflicting runtime placement flags
jdchawla29 8ad2be2
Merge pull request #430 from hud-evals/codex/cloud-runtime-tunnel-sdk
jdchawla29 6979905
fix: authlib deprecation warning
Parth220 584a6af
ad co8
lorenss-m d3d5d19
add modal runtime provider wiring
jdchawla29 6cc2be7
Merge pull request #433 from hud-evals/codex/modal-runtime-provider
jdchawla29 5bea22e
Add hud.TrainingClient + hud models CLI for managed RL training
lorenss-m 87306f3
add small notes
lorenss-m f611497
hud.train: no retry on stateful training POSTs; add 2048 RL cookbook
lorenss-m 98c8792
add timeout safety
lorenss-m 19bfad3
also remote taskset run via cli, report trace info
lorenss-m 4d80d7b
fx
lorenss-m b6dd7be
fx 2
lorenss-m 1b76c65
fix scoring and timeouts
lorenss-m e85b66e
small fix
lorenss-m 1905f42
Merge pull request #426 from hud-evals/l/tinker-training
jdchawla29 3db5250
Standardize default job names
lorenss-m 77cd964
Merge pull request #434 from hud-evals/feat/standardize-default-job-n…
lorenss-m File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| { | ||
| "tasksetId": "de5f3062-2587-4b33-a547-27995df213bd" | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| # HUD Python Agent Guide | ||
|
|
||
| This repository is the Python SDK and CLI for HUD: environments, capabilities, | ||
| tasks, agents, the rollout engine, telemetry, and command-line workflows for | ||
| building and running agent evaluations. | ||
|
|
||
| Priorities: solve the requested problem, keep scope tight, preserve public SDK | ||
| behavior where it is actually shipped, and improve code quality rather than | ||
| adding local workarounds. | ||
|
|
||
| ## Where To Look First | ||
|
|
||
| - `README.md` for the protocol, product concepts, and common CLI workflows. | ||
| - `docs/v6/` for the live SDK docs: quickstart, reference (environment, tasks, | ||
| capabilities, agents, graders, types, cli), run guides, and cookbooks. | ||
| - `CONTRIBUTING.md` for setup, test, lint, and type-check commands. | ||
| - `pyproject.toml` for supported Python versions, dependencies, optional extras, | ||
| ruff, pyright, pytest, and coverage configuration. | ||
| - Source files and colocated tests for exact behavior. Trust code and tests over | ||
| stale prose. | ||
| - `cookbooks/` for runnable end-to-end examples (each is its own uv project). | ||
|
|
||
| Keep this file stable. Do not turn it into a release runbook, command matrix, or | ||
| inventory of current incidents. | ||
|
|
||
| ## Repository Map | ||
|
|
||
| - Core flow: `hud/environment/` (spec: capabilities, tasks, serving) → | ||
| `hud/eval/` (engine: rollout, runtimes, jobs) → `hud/agents/` (harnesses), | ||
| connected by `hud/capabilities/` and `hud/clients/`. | ||
| - `hud/cli/` is the Typer surface over the same modules. | ||
| - `hud/_legacy.py` and `hud/patches/` quarantine v5 compatibility. | ||
| - `cookbooks/` and `integrations/` live outside the `hud` package. | ||
|
|
||
| ## Working Style | ||
|
|
||
| - Run commands from the repository root unless a tool explicitly requires a | ||
| subdirectory. | ||
| - Use `uv` for Python commands. Do not rely on an activated virtualenv. | ||
| - Read files before editing them and follow nearby patterns. | ||
| - Keep edits focused on the requested behavior. Do not clean up unrelated code. | ||
| - Prefer editing existing docs over creating new docs unless the user asks for a | ||
| new document. | ||
| - Do not introduce hacks, monkey patches, or partial workarounds. If a robust | ||
| solution needs missing support, add that support cleanly or report the blocker. | ||
| - Report any part of a change that is uncertain, fragile, or intentionally left | ||
| unverified. | ||
|
|
||
| ## Setup And Checks | ||
|
|
||
| Use the commands in `CONTRIBUTING.md` as the source of truth. Common commands: | ||
|
|
||
| ```bash | ||
| uv sync --extra dev | ||
| uv run pytest -q | ||
| uv run ruff format . --check | ||
| uv run ruff check . | ||
| uv run pyright | ||
| ``` | ||
|
|
||
| The shared pre-push hook lives in `.githooks/pre-push`, but agents should not | ||
| change local git config unless explicitly asked. | ||
|
|
||
| Tests run on Python 3.11 and 3.12 in CI. `pyproject.toml` currently supports | ||
| Python `>=3.11, <3.13`. | ||
|
|
||
| ## Code Quality Bar | ||
|
|
||
| - Prefer direct, typed, maintainable code over clever or magical abstractions. | ||
| - Be ambitious about simplification. Look for ways to delete whole branches, | ||
| helper layers, modes, and special cases while preserving behavior. | ||
| - Fail fast and loudly. Avoid silent fallbacks, broad exception swallowing, and | ||
| defensive branches that hide broken invariants. | ||
| - Minimize branching. Every new `if`, `try`, compatibility path, or nullable mode | ||
| should earn its keep. | ||
| - Preserve documented public API and persisted behavior unless the task is an | ||
| intentional migration. Do not add compatibility layers for unshipped branch | ||
| work; replace the design cleanly. | ||
| - Reuse canonical helpers and local abstractions before adding new ones. | ||
| - Keep feature logic in the layer that owns the concept. Treat scattered | ||
| feature checks in shared paths as a design problem. | ||
| - Prefer explicit contracts over optional, loosely shaped, or cast-heavy data. | ||
| - Delete dead code. Do not keep obsolete paths around "just in case." | ||
| - Keep comments rare and useful. Explain non-obvious intent, not what the next | ||
| line mechanically does. | ||
| - Remove AI-generated slop before finishing: unnecessary comments, abnormal | ||
| defensive checks, broad `try` blocks, type bypasses, deep nesting, and thin | ||
| wrappers that do not reduce real complexity. | ||
| - Be suspicious of files pushed past 1000 lines. Decompose when there is a clear | ||
| focused module to extract. | ||
| - Avoid new core dependencies. If a dependency is only needed for optional | ||
| provider, tool, or integration behavior, put it behind the relevant extra. | ||
|
|
||
| ## Typing And Imports | ||
|
|
||
| - Type public APIs and cross-module contracts. Prefer explicit Pydantic models or | ||
| typed structures over ad-hoc dictionaries at boundaries. | ||
| - `cast(...)` and `assert ...` are acceptable for real type narrowing. Broad | ||
| `# type: ignore` comments are not. | ||
| - Keep `Any` contained to genuinely dynamic payloads such as provider JSON, | ||
| metadata, or third-party integration blobs. | ||
| - Keep imports at the top of the module. Use inline imports only for an existing | ||
| lazy optional-dependency pattern or a documented circular-import constraint. | ||
| - Use `TYPE_CHECKING` imports for type-only imports that would otherwise add | ||
| runtime dependency cost or cycles. | ||
|
|
||
| ## Testing Expectations | ||
|
|
||
| - Add or update focused tests for behavior changes. Put tests near the module | ||
| they cover, following the existing `*/tests/` layout. | ||
| - Test behavior and contracts, not private implementation details. | ||
| - Regression tests should fail on the old behavior through the normal lifecycle | ||
| or public boundary. Do not manually seed private state such as internal maps, | ||
| caches, cursors, or prepared containers just to prove a changed line. | ||
| - If a bug involves internal state, reach it through real setup and execution: | ||
| construction, configuration, preparation, run loop, provider response, tool | ||
| execution, or public API call. | ||
| - Do not add hooks, helper methods, or abstraction layers only to make tests | ||
| easier. If a test needs that, reconsider the behavior boundary instead. | ||
| - Test names should describe the observable behavior or contract, not the | ||
| private mechanism. | ||
| - Mock external services, provider APIs, network, Docker, browser, and filesystem | ||
| boundaries as needed. Do not mock core logic just to make a test easy. | ||
| - Mark tests that require `HUD_API_KEY`, network access, or deployed services as | ||
| integration tests. | ||
| - Run the narrowest relevant tests first, then broader checks when the blast | ||
| radius is shared or user-facing. | ||
|
|
||
| ## Operational Debugging | ||
|
|
||
| - Follow the execution path instead of guessing from abstractions. | ||
| - For CLI issues, start with the command module, then config/settings, then the | ||
| SDK module being exercised. | ||
| - For agent/provider issues, inspect gateway resolution, provider adapter code, | ||
| capability-backed tool wiring, and recorded request/response shapes. | ||
| - For environment/task issues, inspect the task lifecycle (start/grade), the | ||
| control-channel server and client, and capability routing/tunneling. | ||
| - For execution issues, inspect the rollout engine: runtime provider | ||
| acquisition, `connect`, the `Run` lifecycle, and job/trace reporting. | ||
| - For telemetry issues, inspect instrumentation boundaries and exporter behavior | ||
| before changing call sites. | ||
| - Report what was verified, what remains inferred, and which file, test, trace, | ||
| or command output supports the conclusion. | ||
|
|
||
| ## Decision Protocol | ||
|
|
||
| Ask first when scope, public API compatibility, or ownership is unclear. | ||
|
|
||
| Choose and flag when naming, test boundaries, or local structure are ambiguous | ||
| but the direction is straightforward. | ||
|
|
||
| Just do it when fixing formatting, applying an obvious bug fix with clear root | ||
| cause, tightening types, or removing slop that does not change behavior. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| AGENTS.md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.