fix(eval): implement function-call loop so agent can use tools by ericchansen · Pull Request #4 · ericchansen/foundry-agents-lifecycle

ericchansen · 2026-05-28T23:07:28Z

Problem

The CD pipeline fails at the Evaluate Agent (DEV) step with:

azure.ai.evaluation._exceptions.EvaluationException: (UserError) Response string cannot be empty.
  File ".../run_evaluation.py", line 216, in _run_real_evaluation
    g = groundedness_eval(response=resp["answer"], context=resp["expected"])

Root cause

The deployed agent has a custom function tool (calculator). When the Foundry Responses API hits a function tool, it returns function_call output items and waits for the client to execute the tool locally and submit results back via function_call_output items. Until that happens, response.output_text is empty.

The previous code did a single responses.create() and read output_text directly — so every math question in eval_dataset.jsonl came back with an empty answer, and the first call into GroundednessEvaluator crashed the whole pipeline.

Fix

Implement the function-call loop in _query_agent_with_tool_loop():

Send the question via responses.create() with agent_reference
While the response contains function_call items:
- Execute each call locally via _execute_function_tool() (looks up the name in a registry → calls agent.tools.<name>.execute_<name>)
- Submit results as function_call_output items, chained with previous_response_id
- Read the new response
Stop when there are no more function calls (or after 5 iterations as a safety cap)

Also added a diagnostic log when an answer is still empty after the loop completes (prints the tool-call trace), so the next failure of this kind is actually debuggable.

Files

src/scripts/run_evaluation.py (+103 / -13)

Verification

python -m py_compile src/scripts/run_evaluation.py passes
Real verification: CD will run on main after merge and exercise the tool loop against the deployed calculator tool

Out of scope

src/scripts/test_agent.py has the same gap but is a smoke test, not a CI gate — leaving it for a follow-up if you want
No fallback "skip empty responses with zero score" logic added — the tool loop should eliminate the empty case for the current dataset; if it doesn't, the diagnostic log will tell us exactly what's missing

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Copilot

Pull request overview

This PR updates the real evaluation path so deployed Foundry agents that emit Responses API function calls can execute local tools before evaluator scoring.

Changes:

Adds a local function-tool registry and executor for the calculator tool.
Replaces single-shot agent queries with a function-call loop using function_call_output.
Adds diagnostic logging and stores tool-call traces when answers remain empty.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The deployed agent has a custom `function` tool (calculator). When the Responses API hits a function tool, it returns `function_call` output items and waits for the CLIENT to execute the tool and submit results via `function_call_output` items. The previous eval script never did this, so `response.output_text` was empty for any tool-using question, which crashed `GroundednessEvaluator` with `Response string cannot be empty`. Adds: - `_build_function_tool_registry()` — name -> local executor map - `_execute_function_tool()` — JSON args in, JSON output out - `_query_agent_with_tool_loop()` — drives the Responses tool loop using `previous_response_id` chaining, capped at 5 iterations - Diagnostic log when an answer is still empty after the loop Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 28, 2026 23:07

Copilot started reviewing on behalf of ericchansen May 28, 2026 23:07 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread src/scripts/run_evaluation.py

Comment thread src/scripts/run_evaluation.py Outdated

ericchansen force-pushed the fix/eval-function-call-loop branch from 3f7908c to f51d8b6 Compare May 28, 2026 23:13

ericchansen merged commit 5431883 into main May 28, 2026
1 check passed

ericchansen deleted the fix/eval-function-call-loop branch May 28, 2026 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): implement function-call loop so agent can use tools#4

fix(eval): implement function-call loop so agent can use tools#4
ericchansen merged 1 commit into
mainfrom
fix/eval-function-call-loop

ericchansen commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericchansen commented May 28, 2026

Problem

Root cause

Fix

Files

Verification

Out of scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants