Fix _TaskNavigationEfficiencyEvaluator rejecting valid inputs from cloud Foundry/ACA runtime#46745
Draft
Fix _TaskNavigationEfficiencyEvaluator rejecting valid inputs from cloud Foundry/ACA runtime#46745
_TaskNavigationEfficiencyEvaluator rejecting valid inputs from cloud Foundry/ACA runtime#46745Conversation
…om cloud runtime Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e644088f-cada-4fcf-a537-c3d6c26212eb Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e644088f-cada-4fcf-a537-c3d6c26212eb Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix response validation for TaskNavigationEfficiencyEvaluator
Fix May 6, 2026
_TaskNavigationEfficiencyEvaluator rejecting valid inputs from cloud Foundry/ACA runtime
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The cloud Foundry/ACA evaluation runtime serializes list/object
dataMappingfields to JSON-encoded strings before invoking the Python evaluator entry-point, causing_TaskNavigationEfficiencyEvaluatorto always fail with'response' must be a list of messages.even when the dataset contains valid lists.Changes
_task_navigation_efficiency.py_maybe_json_decode(value, field_name)static method — transparently decodes JSON-string inputs; raisesEvaluationException(category=INVALID_VALUE)on malformed JSON_real_callto decoderesponseandground_truthbefore validation, making cloud-runtime and in-process call paths equivalent_do_evalto treat a 2-element[list, dict]as equivalent to a native(list, dict)tuple — required because JSON has no tuple type, soground_truth=(["a"], {...})round-trips to[["a"], {...}]_task_navigation_efficiency_validator.py_validate_tool_names_and_params(tool_names, parameters)helper_validate_ground_truthto accept the JSON round-tripped 2-element[list, dict]form alongside native tuples and plain string lists; all existing error messages and blame/category/target values preservedtests/unittests/test_task_navigation_efficiency_string_inputs.py(new)response+ground_truthaccepted and produce identical results to native inputsground_truthtriggers parameter matchingEvaluationException(category=INVALID_VALUE)with new message'response' must be a list of messages.error preservedWarning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
pypi.org/home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.10/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-q0mabqjx/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0(dns block)/home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/pip pip install httpx -q(dns block)scanning-api.github.com/home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
Summary
_TaskNavigationEfficiencyEvaluatorfails in cloud Foundry / ACA evaluation runs with:even when the user's dataset rows contain perfectly valid
responselists andground_truthlists. (The published asset inAzure/azureml-assets— same code, with parameter names renamed toactions/expected_actions— exhibits the identical bug; a parallel PR is being opened against that repo.)Root cause
When this evaluator is invoked through the OpenAI‑Evals
dataMappingpipeline as acodeevaluator, the runtime delivers list/object fields to the Python entrypoint as JSON-encoded strings (e.g."[{\"role\":\"user\",...}]") instead of Python lists. This happens because the templating substitution${data.response}does not preserve container types for complex values; the runtime callsjson.dumpsbefore invoking the grader.The validator at:
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.pyrejects strings outright in
_validate_response:…and in
_validate_ground_truth, thelistbranch insists every element is astr, which also rejects the JSON-decoded tuple-form[[...names...], {...params...}](JSON has no tuple type, so the documented tuple form(list, dict)round-trips to a 2-element list).This contradicts (a) the evaluator's own typing —
response: Union[str, List[Dict[str, Any]]]in the__call__overloads at_task_navigation_efficiency.py— and (b) the behavior of peer evaluators (ToolCallAccuracy,TaskAdherence,IntentResolution, etc.) which already accept stringified inputs from the same runtime.Required changes
1.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_navigation_efficiency/_task_navigation_efficiency.pyAdd a small JSON-decode helper and call it at the top of
_real_call, beforeself._validator.validate_eval_input(kwargs), so both the validator and_do_evalsee Python objects:_do_evalalready acceptsground_truthas either atupleor a 2-elementlist(see existingisinstance(ground_truth, tuple) and len(ground_truth) == 2and thelistbranch). No changes are required to_do_evalitself; please verify the existing logic still works after the fix.2.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_validators/_task_navigation_efficiency_validator.pyUpdate
_validate_ground_truthso that, in addition to the existingtupleand "list of strings" branches, a 2-element list/tuple whose first element is alistand whose second element is adictis treated as the tuple-form(tool_names_list, parameters_dict)and validated using the same rules as the existing tuple branch.Refactor the existing tuple-validation logic into a private helper (e.g.
_validate_tool_names_and_params(self, tool_names, parameters) -> Optional[EvaluationException]) so it is shared between thetuplepath and the new "2-element list" path. Keep all existing user-facing error messages andErrorBlame/ErrorCategory/targetvalues for cases already covered. The fall-through "must be either a list of tool names or a tuple of (tool_names_list, parameters_dict)" error must still be raised for genuinely invalid shapes.The decision tree should be roughly:
ground_truth is Noneor empty → existing missing-field error.ground_truthistuple→ ex...This pull request was created from Copilot chat.