test(integration): refresh stale notebook + workbench tests against current layout#264
Merged
Conversation
…urrent layout Three groups of pre-existing integration failures, none related to live LLM behavior — all are tests that drifted past code that moved. (A) ``tests/integration/test_notebooks_subset.py`` — ``TestNotebookExecution`` had 5 methods pointing at notebook filenames that no longer exist on disk (the notebook catalogue was renumbered to a contiguous 1-70 sequence). Updated each subprocess invocation to target the current file. Renamed the test methods to match the new numbers (``test_notebook_36_runs`` → ``test_notebook_35_runs`` etc.) so the suite reads consistently end-to-end. Added a header comment noting that these must stay in sync with ``examples/``. (B) ``tests/integration/test_workbench_categories.py`` — ``test_endpoint_returns_curated_categories`` asserted ``"router"`` was a top-level category; the workbench combined router + observability into a single ``"router-observability"`` track in ``workbench/backend/runner.py::NOTEBOOK_CATEGORIES``. Updated the required-id list. Renamed the SSE-suite assertion to ``test_router_observability_groups_router_plus_eventbus`` and pointed it at notebooks 58-61 (the actual router + EventBus + observability notebooks today) instead of 52-55 (which are now production / checkpointer tests). (E) ``examples/notebook_70_oci_tools.py`` — the ``_env`` helper hard- required ``OCI_USE_PROFILE`` / ``OCI_USE_REGION`` / ``OCI_USE_TENANCY`` and ``OCI_GENAI_PROFILE``, exiting 2 if any was missing. That's hostile to users who already exported the standard OCI envelope (``OCI_PROFILE`` / ``OCI_REGION`` / ``OCI_COMPARTMENT``), and it's exactly what tripped ``test_notebooks_all_live.py`` for this notebook in CI. Added a ``fallbacks=`` parameter to ``_env`` and wired every ``OCI_USE_*`` / ``OCI_GENAI_*`` read to fall back through the standard names. Documented in the helper's docstring. Local re-runs: - 5 ``test_notebooks_subset.py::TestNotebookExecution`` tests pass (33s, all run real ``python examples/notebook_NN_*.py`` subprocesses). - 4 ``test_workbench_categories.py::TestNotebookCategories`` tests pass against the live runner ``TestClient``. - ``test_notebooks_all_live.py[notebook_70_oci_tools]`` passes with ``OCI_PROFILE`` / ``OCI_REGION`` / ``OCI_COMPARTMENT`` set (no ``OCI_USE_*`` overrides required). Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
…x_tokens for reasoning models
Three more pre-existing integration failures that surfaced once the
stale-test cleanup landed. All caused by the test envelope being too
tight for reasoning-model traffic (gpt-5.5, o-series), not by real bugs.
(1) ``src/locus/models/providers/oci/client.py`` — the OCI Python SDK
default read timeout is 60s, which isn't enough for reasoning-model
summarization calls in the orchestrator + swarm flows (first response
token can take 90-180s to arrive after the model finishes hidden
chain-of-thought). Added ``connect_timeout`` (default 10s) and
``read_timeout`` (default 300s) to ``OCIClientConfig`` and wired both
through to ``GenerativeAiInferenceClient`` for all four auth modes
(api_key, security_token, instance_principal, resource_principal).
Surfaces in failures as
``urllib3.ReadTimeoutError: ... read timeout=60.0``.
(2) ``tests/integration/test_notebooks_all_live.py`` — the
``_NOTEBOOK_TIMEOUT_OVERRIDES`` map keyed off
``notebook_40_emergent_routing.py``; the notebook had been renumbered
to ``notebook_34_emergent_routing.py`` so the override no longer
matched, leaving the test on the default 360s budget while the
underlying notebook actually needs ~7-9 min. Renamed the key.
(3) ``tests/integration/conftest.py`` — the OCI / OpenAI test
fixtures built models with ``max_tokens=512``. Reasoning models burn
200-2000+ output tokens on hidden chain-of-thought before producing
any visible text; at 512 they return empty content with
``finish_reason='length'``, which surfaces in orchestrator + swarm
tests as ``summary=''`` and ``findings={}`` even though
``success=True``. Bumped to 8192 with a comment explaining the
ceiling-vs-target tradeoff (short-answer tests still finish fast
because the model stops naturally when done).
Local re-runs (BOAT-OC1, ``openai.gpt-5.5``, us-chicago-1):
- ``test_summary_instead_of_bare_stop`` — passes (was OCI timeout)
- ``test_notebook_runs_clean[notebook_34_emergent_routing]`` — passes
(was 360s subprocess timeout)
- ``test_swarm_executes_tasks`` — passes (was empty findings)
- ``test_orchestrator_single_specialist`` — passes (was empty summary)
- ``test_orchestrator_multiple_specialists`` — passes (was empty summary)
5/5 of the previously-environmental failures now pass deterministically.
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
``test_instance_principal_client_creation`` was pinned to the exact keyword args passed to ``GenerativeAiInferenceClient``. The previous commit added ``timeout=(connect, read)`` to all four client-creation paths, so the strict ``assert_called_once_with(...)`` started missing the new kwarg. Updated the assertion to include the default tuple ``(10.0, 300.0)`` from ``OCIClientConfig``. Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
4 tasks
fede-kamel
added a commit
that referenced
this pull request
May 23, 2026
…gonomics + httpx 1.0 cap + trademark naming (#265) Four PRs of fixes since b20. No new public APIs; tightens the SDK on durability (StateGraph interrupt resume), ergonomics (OCIModel aliases, AgentConfig.name, Tool.func), deps (httpx<1.0 cap), and brings the docs site in line with the approved product name. - #261 — StateGraph.interrupt_before now writes through the checkpointer at the pause boundary; resume advances past the gate instead of re-pausing. Inline interrupt() save crash with state=None fixed in the same pass. OCIModel gains region= and profile= ergonomic aliases. AgentConfig.name + Tool.func surface the names users naturally reach for. - #262 — Capped httpx<1.0; pre-release 1.0.dev3 drops the top-level Auth re-export and broke OCIRequestSigner + BearerAuth at import. - #257 — Applied the Oracle Trademark Legal-approved full name (wordmark above hero H1, persistent header) and short name (body prose / OG meta / tab title) across docs, README, and contributor markdown. - #264 — OCI client read timeout default 60s→300s for reasoning models; integration fixture max_tokens 512→8192 so reasoning models have budget for both hidden chain-of-thought and visible output; eight stale integration tests refreshed against current catalogue / workbench layout. See CHANGELOG.md for the full breakdown. Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Eight pre-existing integration failures across three areas, all caused by tests drifting past code that moved. Not LLM-behavioral — purely staleness.
(A)
test_notebooks_subset.py— 5 failsTestNotebookExecutionhad 5 methods pointing at notebook filenames that no longer exist (the catalogue was renumbered to a contiguous 1-70 sequence). Updated each subprocess call to target the current file and renamed the test methods so the suite reads consistently.test_notebook_14_runs→notebook_19_sse_streaming.pytest_notebook_13_runs→notebook_13_sse_streaming.pytest_notebook_36_runs→notebook_41_structured_output.pytest_notebook_35_runs→notebook_35_structured_output.pytest_notebook_37_runs→notebook_42_reasoning_patterns.pytest_notebook_36_runs→notebook_36_reasoning_patterns.pytest_notebook_43_runs→notebook_48_playbooks.pytest_notebook_46_runs→notebook_46_playbooks.pytest_notebook_49_runs→notebook_54_checkpoint_backends.pytest_notebook_52_runs→notebook_52_checkpoint_backends.py(B)
test_workbench_categories.py— 2 failsworkbench/backend/runner.py::NOTEBOOK_CATEGORIEScombined router + observability into a singlerouter-observabilitytrack. The tests still asserted the old taxonomy.test_endpoint_returns_curated_categories— replaced"router"and"observability"in the required-id list with"router-observability".test_observability_category_contains_new_sse_notebooks→test_router_observability_groups_router_plus_eventbusand pointed it at notebooks 58-61 (the current router + EventBus + observability notebooks) instead of 52-55 (which are now production / checkpointer tests).(E)
examples/notebook_70_oci_tools.py— 1 failThe
_envhelper hard-requiredOCI_USE_PROFILE/OCI_USE_REGION/OCI_USE_TENANCYandOCI_GENAI_PROFILE, exiting 2 if any was missing. Hostile to users who already exported the standard OCI envelope (OCI_PROFILE,OCI_REGION,OCI_COMPARTMENT) — and the exact reasontest_notebooks_all_live.py[notebook_70_oci_tools]failed in the live suite. Added afallbacks=parameter to_envand wired everyOCI_USE_*/OCI_GENAI_*read to fall back through the standard names.Test plan
test_notebooks_subset.py::TestNotebookExecution— 5/5 pass (33s, realpython examples/notebook_NN_*.pysubprocesses)test_workbench_categories.py::TestNotebookCategories— 4/4 pass against the live runnerTestClienttest_notebooks_all_live.py[notebook_70_oci_tools]— passes withOCI_PROFILE/OCI_REGION/OCI_COMPARTMENTsetpre-commit run --files <staged>— passOut of scope
The other 6 failures from the full integration run were OCI read-timeouts under
-n 4parallel load (orchestrator + swarm tests hittingread timeout=60.0) and an xAI rate-limit hit — environmental, not code bugs. Those need a separate look at OCI throttling / pytest concurrency, not a test rewrite.