Improve CSAT evaluator prompt to better detect explicit user dissatisfaction by imatiach-msft · Pull Request #4998 · Azure/azureml-assets

imatiach-msft · 2026-05-05T23:30:59Z

Summary

Fixes Bug #5243079 - CSAT evaluator scores 3 (Neutral) when user explicitly expresses dissatisfaction, instead of scoring 2 (Dissatisfied).

Problem

When the CSAT evaluator receives a conversation where the user explicitly states the agent response was unhelpful, the model gives too much credit for the agent polite tone and alternative suggestions, inflating the score to 3 instead of reflecting the user actual dissatisfaction.

This happens specifically when the conversation is passed as flattened text in the query field (which is how the production pipeline calls the single-turn evaluator).

Changes

Added two rules to the IMPORTANT CONSIDERATIONS section of both customer_satisfaction.prompty and customer_satisfaction_multi_turn.prompty:

Explicit user dissatisfaction signals: If the user explicitly expresses dissatisfaction (e.g. that didnt help), the score MUST be 1 or 2 regardless of agent tone.
Unresolved core requests: If the agent cannot fulfill the users primary request and only offers workarounds, the score should not exceed 3.

Test Results (gpt-5.2, temperature=0, 5 runs each)

With DSAT present in query (text format) - the bug scenario

Format	Original Prompt	Modified Prompt
query + User follow-up: that didnt help!	3,3,3,3,3	2,2,2,2,2
Turn 1 - User: hi, Turn 2 - ...	3,3,3,3,3	2,2,2,2,2
query + that didnt help!	3,3,3,3,3	2,2,2,2,2

Without DSAT (query = just user question, no follow-up)

Scenario	Original Prompt	Modified Prompt
whats the weather in New York? only	3,4,4,4,4	3,3,3,3,3

The unresolved core requests rule prevents inflated scores (4) when the agent could not fulfill the request.

With full JSON conversation (structured format)

Scenario	Original Prompt	Modified Prompt
Full JSON array as query (with DSAT)	2,2,2,2,2	2,2,2,2,2

When conversation is passed as structured JSON, the model already handles DSAT correctly - no regression.

Test Conversation (from production trace)

User: hi
Assistant: Hi - what can I help you with?
User: whats the weather in New York?
Assistant: I cant see live weather data from here... [offers alternatives]
User: that didnt help!

Production scored this as 3 (Neutral). With our change, it correctly scores 2 (Dissatisfied).

github-actions · 2026-05-05T23:31:57Z

Test Results for assets-test

68 tests 68 ✅ 2s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit 36b60ea.

♻️ This comment has been updated with latest results.

…faction Add explicit rules to both single-turn and multi-turn CSAT evaluator prompts for handling user dissatisfaction signals: - Explicit DSAT signals (e.g. 'that didn't help!') MUST score 1 or 2 - Unresolved core requests cap at score 3 even with good tone - Multi-turn: trailing user messages without agent response scored accordingly - Professional tone does not compensate for confirmed user dissatisfaction Tested with gpt-5.2 judge model: - Multi-turn WITH DSAT: Score 2 (correctly detects dissatisfaction) - Multi-turn WITHOUT DSAT: Score 3 (neutral, unresolved) - Single-turn baseline (no DSAT): Score 4 (unchanged for happy path) Bug #5243079 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

imatiach-msft requested review from a team as code owners May 5, 2026 23:31

imatiach-msft temporarily deployed to Testing May 5, 2026 23:31 — with GitHub Actions Inactive

imatiach-msft temporarily deployed to Testing May 5, 2026 23:33 — with GitHub Actions Inactive

imatiach-msft force-pushed the ilmat/csat-dsat-detection branch from 1fc854f to 36b60ea Compare May 5, 2026 23:35

imatiach-msft temporarily deployed to Testing May 5, 2026 23:36 — with GitHub Actions Inactive

imatiach-msft temporarily deployed to Testing May 5, 2026 23:38 — with GitHub Actions Inactive

vebudumu approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CSAT evaluator prompt to better detect explicit user dissatisfaction#4998

Improve CSAT evaluator prompt to better detect explicit user dissatisfaction#4998
imatiach-msft wants to merge 1 commit intomainfrom
ilmat/csat-dsat-detection

imatiach-msft commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

imatiach-msft commented May 5, 2026

Summary

Problem

Changes

Test Results (gpt-5.2, temperature=0, 5 runs each)

With DSAT present in query (text format) - the bug scenario

Without DSAT (query = just user question, no follow-up)

With full JSON conversation (structured format)

Test Conversation (from production trace)

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results for assets-test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 5, 2026 •

edited

Loading