Skip to content

fix(eval): implement function-call loop so agent can use tools#4

Merged
ericchansen merged 1 commit into
mainfrom
fix/eval-function-call-loop
May 28, 2026
Merged

fix(eval): implement function-call loop so agent can use tools#4
ericchansen merged 1 commit into
mainfrom
fix/eval-function-call-loop

Conversation

@ericchansen
Copy link
Copy Markdown
Owner

Problem

The CD pipeline fails at the Evaluate Agent (DEV) step with:

azure.ai.evaluation._exceptions.EvaluationException: (UserError) Response string cannot be empty.
  File ".../run_evaluation.py", line 216, in _run_real_evaluation
    g = groundedness_eval(response=resp["answer"], context=resp["expected"])

Root cause

The deployed agent has a custom function tool (calculator). When the Foundry Responses API hits a function tool, it returns function_call output items and waits for the client to execute the tool locally and submit results back via function_call_output items. Until that happens, response.output_text is empty.

The previous code did a single responses.create() and read output_text directly — so every math question in eval_dataset.jsonl came back with an empty answer, and the first call into GroundednessEvaluator crashed the whole pipeline.

Fix

Implement the function-call loop in _query_agent_with_tool_loop():

  1. Send the question via responses.create() with agent_reference
  2. While the response contains function_call items:
    • Execute each call locally via _execute_function_tool() (looks up the name in a registry → calls agent.tools.<name>.execute_<name>)
    • Submit results as function_call_output items, chained with previous_response_id
    • Read the new response
  3. Stop when there are no more function calls (or after 5 iterations as a safety cap)

Also added a diagnostic log when an answer is still empty after the loop completes (prints the tool-call trace), so the next failure of this kind is actually debuggable.

Files

  • src/scripts/run_evaluation.py (+103 / -13)

Verification

  • python -m py_compile src/scripts/run_evaluation.py passes
  • Real verification: CD will run on main after merge and exercise the tool loop against the deployed calculator tool

Out of scope

  • src/scripts/test_agent.py has the same gap but is a smoke test, not a CI gate — leaving it for a follow-up if you want
  • No fallback "skip empty responses with zero score" logic added — the tool loop should eliminate the empty case for the current dataset; if it doesn't, the diagnostic log will tell us exactly what's missing

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Copilot AI review requested due to automatic review settings May 28, 2026 23:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the real evaluation path so deployed Foundry agents that emit Responses API function calls can execute local tools before evaluator scoring.

Changes:

  • Adds a local function-tool registry and executor for the calculator tool.
  • Replaces single-shot agent queries with a function-call loop using function_call_output.
  • Adds diagnostic logging and stores tool-call traces when answers remain empty.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/scripts/run_evaluation.py
Comment thread src/scripts/run_evaluation.py Outdated
The deployed agent has a custom `function` tool (calculator). When the
Responses API hits a function tool, it returns `function_call` output items
and waits for the CLIENT to execute the tool and submit results via
`function_call_output` items. The previous eval script never did this, so
`response.output_text` was empty for any tool-using question, which crashed
`GroundednessEvaluator` with `Response string cannot be empty`.

Adds:
- `_build_function_tool_registry()` — name -> local executor map
- `_execute_function_tool()` — JSON args in, JSON output out
- `_query_agent_with_tool_loop()` — drives the Responses tool loop using
  `previous_response_id` chaining, capped at 5 iterations
- Diagnostic log when an answer is still empty after the loop

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ericchansen ericchansen force-pushed the fix/eval-function-call-loop branch from 3f7908c to f51d8b6 Compare May 28, 2026 23:13
@ericchansen ericchansen merged commit 5431883 into main May 28, 2026
1 check passed
@ericchansen ericchansen deleted the fix/eval-function-call-loop branch May 28, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants