Skip to content

Improve CSAT evaluator prompt to better detect explicit user dissatisfaction#4998

Open
imatiach-msft wants to merge 1 commit intomainfrom
ilmat/csat-dsat-detection
Open

Improve CSAT evaluator prompt to better detect explicit user dissatisfaction#4998
imatiach-msft wants to merge 1 commit intomainfrom
ilmat/csat-dsat-detection

Conversation

@imatiach-msft
Copy link
Copy Markdown
Contributor

Summary

Fixes Bug #5243079 - CSAT evaluator scores 3 (Neutral) when user explicitly expresses dissatisfaction, instead of scoring 2 (Dissatisfied).

Problem

When the CSAT evaluator receives a conversation where the user explicitly states the agent response was unhelpful, the model gives too much credit for the agent polite tone and alternative suggestions, inflating the score to 3 instead of reflecting the user actual dissatisfaction.

This happens specifically when the conversation is passed as flattened text in the query field (which is how the production pipeline calls the single-turn evaluator).

Changes

Added two rules to the IMPORTANT CONSIDERATIONS section of both customer_satisfaction.prompty and customer_satisfaction_multi_turn.prompty:

  1. Explicit user dissatisfaction signals: If the user explicitly expresses dissatisfaction (e.g. that didnt help), the score MUST be 1 or 2 regardless of agent tone.
  2. Unresolved core requests: If the agent cannot fulfill the users primary request and only offers workarounds, the score should not exceed 3.

Test Results (gpt-5.2, temperature=0, 5 runs each)

With DSAT present in query (text format) - the bug scenario

Format Original Prompt Modified Prompt
query + User follow-up: that didnt help! 3,3,3,3,3 2,2,2,2,2
Turn 1 - User: hi, Turn 2 - ... 3,3,3,3,3 2,2,2,2,2
query + that didnt help! 3,3,3,3,3 2,2,2,2,2

Without DSAT (query = just user question, no follow-up)

Scenario Original Prompt Modified Prompt
whats the weather in New York? only 3,4,4,4,4 3,3,3,3,3

The unresolved core requests rule prevents inflated scores (4) when the agent could not fulfill the request.

With full JSON conversation (structured format)

Scenario Original Prompt Modified Prompt
Full JSON array as query (with DSAT) 2,2,2,2,2 2,2,2,2,2

When conversation is passed as structured JSON, the model already handles DSAT correctly - no regression.

Test Conversation (from production trace)

User: hi
Assistant: Hi - what can I help you with?
User: whats the weather in New York?
Assistant: I cant see live weather data from here... [offers alternatives]
User: that didnt help!

Production scored this as 3 (Neutral). With our change, it correctly scores 2 (Dissatisfied).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Test Results for assets-test

68 tests   68 ✅  2s ⏱️
 1 suites   0 💤
 1 files     0 ❌

Results for commit 36b60ea.

♻️ This comment has been updated with latest results.

…faction

Add explicit rules to both single-turn and multi-turn CSAT evaluator prompts
for handling user dissatisfaction signals:

- Explicit DSAT signals (e.g. 'that didn't help!') MUST score 1 or 2
- Unresolved core requests cap at score 3 even with good tone
- Multi-turn: trailing user messages without agent response scored accordingly
- Professional tone does not compensate for confirmed user dissatisfaction

Tested with gpt-5.2 judge model:
- Multi-turn WITH DSAT: Score 2 (correctly detects dissatisfaction)
- Multi-turn WITHOUT DSAT: Score 3 (neutral, unresolved)
- Single-turn baseline (no DSAT): Score 4 (unchanged for happy path)

Bug #5243079

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants