fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5651
fix: terminate infinite retry loop in LoadSkillResourceTool on RESOURCE_NOT_FOUND#5651Raman369AI wants to merge 4 commits into
Conversation
…CE_NOT_FOUND When load_skill_resource returned RESOURCE_NOT_FOUND, the LLM treated it as a transient soft error and retried the same path indefinitely, producing runaway invocations and large API bills. Two complementary guards are added: 1. Code (LoadSkillResourceTool.run_async): an invocation-scoped retry guard tracks already-attempted (skill, path) pairs in tool_context.state under a temp:_adk_skill_resource_failed_paths_<id> key. The temp: prefix prevents persistence to durable session storage; the invocation_id suffix prevents leakage across invocations on in-memory session backends (where temp keys are not auto-cleared). A second call on the same path within the same invocation returns RESOURCE_NOT_FOUND_FATAL with an explicit stop instruction, giving the LLM an unambiguous terminal signal. 2. Prompt (DEFAULT_SKILL_SYSTEM_INSTRUCTION): adds a rule prohibiting retrying the same path after any error, and a scope boundary clarifying that load_skill_resource is for skill-bundled files only, not for runtime user input (a common source of hallucinated paths). Neither guard alone is sufficient: a code-only stop produces confusing downstream behavior when the LLM has no semantic understanding of why to stop; a prompt-only guard can be ignored under imperfect system prompts. Both layers are required for defense-in-depth. Tests cover: repeat-failure escalation, distinct-path soft errors not escalating, cross-invocation isolation with shared session state, and the temp: prefix on tracking keys.
E2E Reproduction Evidence — RunnerThis comment adds live end-to-end evidence as required by CONTRIBUTING.md § Manual End-to-End Tests (Runner). Minimal Reproduction Agent# test_agent/agent.py
from google.adk.agents import Agent
from google.adk.skills import models
from google.adk.tools.skill_toolset import SkillToolset
greeting_skill = models.Skill(
frontmatter=models.Frontmatter(
name="document-classifier",
description=(
"Classifies the document supplied by the user based on the reference document"
),
),
instructions=(
"Use Document 1 as reference document and Document 2 as input. Classify each document as"
" python-code, non-python-code, or mixed. Return a short comparison."
),
)
root_agent = Agent(
model="gemini-3-flash-preview",
name="root_agent",
description="A repro agent for SkillToolset inline document handling.",
instruction=("Always use the document as the reference and summarize it."),
tools=[SkillToolset(skills=[greeting_skill])],
)Run command: adk webTrigger prompt (any vague document reference is sufficient): What happens (unpatched)The skill instructions reference "Document 1" and "Document 2" as if they are bundled skill resources, but no files are attached to the skill ( From the LLM's perspective, ADK Web trace —
(⚡ = RESOURCE_NOT_FOUND error returned; ✓ = tool call dispatched successfully but returned the error string — the alternating pattern reflects the model immediately re-invoking after each soft failure.) Server log — loop persisting through and beyond CTRL+C: The loop continued firing new LLM requests for ~20 seconds after the first CTRL+C, requiring repeated interrupt signals to force-quit. In-flight API calls are not cancellable — every iteration that completes before the process dies is billed. Why ambiguous prompts make this a framework issue, not a user errorThis reproduction agent is not contrived — skill instructions that reference documents by natural-language names ("Document 1", "reference document") are a normal and expected authoring pattern. The ambiguity is unavoidable: skill authors do not know at write-time whether end-users will supply input inline or whether the model will infer a resource load. The framework must be defensive here because:
The defense-in-depth approach in this PR — |
|
Hi @Raman369AI , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Your PR has been received by the team and is currently under review. We will provide feedback as soon as we have an update to share. |
|
Hi @xuanyang15 , can you please review this. |
- Replace per-path failed-paths list with a total RESOURCE_NOT_FOUND
counter (temp:_adk_skill_resource_not_found_count_<invocation_id>).
Live tracing showed the LLM hallucinates a different path on every
retry, so a per-path guard never fired. The counter catches any second
failure regardless of path.
- Fix forbidden logger pattern in oauth_mcp_server.py
(getLogger(__name__) → getLogger('google_adk.' + __name__)).
- Update tests to match counter-based logic: rename/rewrite guard tests,
add test_load_resource_different_path_also_escalates_to_fatal to cover
the hallucinated-variant pattern, fix mock state issue in parametrized
test and test_scripts_resource_not_found.
f18f9f9 to
f89e4b6
Compare
Fixes #5652
Summary
Closes a latent defect in
SkillToolsetthat lets the LLM enter an unbounded retry loop whenload_skill_resourcereturnsRESOURCE_NOT_FOUND. Because the only existing backstop isRunConfig.max_llm_calls(default 500), a hallucinated path can quietly burn the entire per-invocation call budget before the framework intervenes — andmax_llm_callsis a global cap on legitimate reasoning, not a defense against this specific failure mode.This fix adds invocation-scoped termination to the tool itself, plus a complementary system-prompt rule, so the framework no longer relies on a perfect upstream prompt to avoid unbounded loops on a known error path.
Why this matters
This is a structural problem, not an edge case:
load_skillresponse intentionally omits a manifest of available files — that is the agentskills.io progressive-disclosure contract, and it is correct for token economy. But it means the LLM must infer paths from prose instructions insideSKILL.md. Inferred paths are routinely wrong, even with strong models.RESOURCE_NOT_FOUNDis returned as a structured soft string. From the model's perspective it looks transient and recoverable, exactly like a flaky network error — so it retries. There is no terminal signal in the current implementation.SkillToolsetdistinguishes "first miss" from "500th miss". The loop terminates only whenmax_llm_calls(default 500) is hit.load_skill_resource) and runtime user inputs (e.g., a PDF the user is processing). A model asked to analyze a runtime document will sometimes route it throughload_skill_resource, hitRESOURCE_NOT_FOUND, and loop — even though the file was never a skill resource.Live trace evidence
Captured via
GET /debug/trace/session/{session_id}againstadk webwith a patched build:The model tried
references/reference_doc.mdfirst, then hallucinated a completely different path (references/Document1.md) on the retry. The counter-based guard caught the second failure and returnedRESOURCE_NOT_FOUND_FATAL, terminating the loop. A per-path guard would have missed this entirely.What changed
Two layers in one file (
src/google/adk/tools/skill_toolset.py), plus tests:1. Code: invocation-scoped failure counter in
LoadSkillResourceTool.run_asyncA total
RESOURCE_NOT_FOUNDcount (across all paths) is tracked ontool_context.stateunder the key:temp:prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage.<invocation_id>suffix ensures correctness on in-memory session backends, wheretemp:keys are added tosession.stateand are not auto-cleared between invocations. Without the suffix, a failure in invocation A would spuriously escalate the first attempt in invocation B.Behavior:
RESOURCE_NOT_FOUNDwithin an invocation (any path) → returnsRESOURCE_NOT_FOUND(unchanged).RESOURCE_NOT_FOUNDwithin the same invocation → returns the newRESOURCE_NOT_FOUND_FATALerror code with an explicit "do not retry — report the error and stop" message. The counter includes the failure number so the LLM has unambiguous context.The guard is invocation-scoped, path-agnostic (catches hallucinated variants), and adds no overhead on the success path.
2. Prompt: two additions to
_DEFAULT_SKILL_SYSTEM_INSTRUCTIONload_skill_resourcereturns any error, do not retry the same path. Report the error to the user and stop."load_skill_resourceis only for skill-bundled files inreferences/,assets/, orscripts/— not for runtime user input. This addresses the secondary failure mode where the model confuses a runtime document with a skill resource.Why both layers
Defense-in-depth. Code-only termination produces confusing downstream behavior when the LLM has no semantic reason to stop. Prompt-only termination relies on the LLM following the rule, which imperfect upstream prompts can override. Together they are robust.
Considered and rejected
max_llm_callsafter_tool_callbackworkaroundSkillToolset; the framework still ships with the loopavailable_resourcesmanifest to the L2load_skillresponselist_skill_resourcestoolTest plan
New tests in
tests/unittests/tools/test_skill_toolset.py:test_load_resource_repeated_failure_escalates_to_fatal— second call on the same missing path within an invocation returnsRESOURCE_NOT_FOUND_FATAL.test_load_resource_different_path_also_escalates_to_fatal— second call on a different missing path also returnsRESOURCE_NOT_FOUND_FATAL; proves the counter is path-agnostic and catches the hallucinated-variant pattern observed in live testing.test_load_resource_failures_isolated_per_invocation— failures from invocation A do not escalate the first attempt in invocation B even when sharing the same session state dict.test_load_resource_counter_uses_temp_prefix— every key written by the guard starts withtemp:so it is trimmed from the persisted event delta and never reaches durable storage.test_load_resource_first_missing_returns_soft_error— firstRESOURCE_NOT_FOUNDin an invocation returns the unchanged soft error code and message.Verification:
pyink --checkclean on both modified files.isort --check-onlyclean on both modified files.mypy src/google/adk/tools/skill_toolset.pyreports the same 17 pre-existing errors asmain; zero new errors introduced.Backwards compatibility
RESOURCE_NOT_FOUNDerror code, same error string. Existing callers see no difference.RESOURCE_NOT_FOUND_FATALcode is purely additive.temp:prefix and is guaranteed not to leak into persisted session storage.