Skip to content

[GA Day 1] Unrecoverable model error loop, agent misidentifies own UI, no self-diagnostic capability #98

@ericchansen

Description

@ericchansen

Summary

Three bugs encountered on Day 1 of Azure SRE Agent GA (March 11, 2026). Created agent "sre-sandbox" in eastus2, connected 4 GitHub repos via OAuth, and exercised the new onboarding flow. The agent performed an impressive initial investigation but then entered an unrecoverable error state.

Environment

Component Value
Agent name sre-sandbox
Agent type GA (created March 11, 2026)
Region eastus2
Connected repos 4 repos via OAuth
Logs connector None (using built-in Azure Monitor)
Browser Microsoft Edge on Windows 11

Bug 1: Unrecoverable "Internal Model Connection Error" Loop (High Severity)

After a successful onboarding conversation and autonomous investigation, the agent entered an unrecoverable error state where every subsequent user message triggers "a temporary AI model connection error." The agent never recovered, even after multiple retries over ~10 minutes.

Steps to Reproduce

  1. Created a new GA SRE Agent at sre.azure.com
  2. Connected 4 GitHub repos via OAuth, configured resource group access
  3. Agent initiated onboarding: read READMEs, created memory files (team.md, overview.md), ran Azure CLI commands (az containerapp list, az containerapp show), and produced a detailed investigation report
  4. User sent follow-up messages asking about cold start behavior and resource group cleanup
  5. Every subsequent agent response failed with "temporary AI model connection error"
  6. Auto-retry mechanism triggered 3 times — all failed
  7. Thread became permanently stuck; no further agent responses possible

Expected Behavior

  • Agent should recover from transient model connection errors via retry/backoff
  • If the backing model is unavailable, provide a clear status message
  • The conversation thread should remain functional after recovery

Actual Behavior

  • Agent entered a permanent error loop with no recovery
  • User had to abandon the thread entirely and start a new one
  • No error details exposed — just a generic "temporary AI model connection error"
  • The auto-retry messages also all failed

Timestamps (PT)


Bug 2: Agent Misidentifies Its Own UI as a Customer App (Medium Severity)

When the user shared a screenshot of the broken chat thread (Bug 1) in a new conversation, the agent misidentified its own portal UI as the "Itemwise app" and attempted to investigate a non-existent application error.

Steps to Reproduce

  1. Take a screenshot of the broken SRE Agent chat (showing the error loop from Bug 1)
  2. Start a new chat thread in the same SRE Agent
  3. Send the screenshot to the agent

Expected Behavior

  • Agent should recognize its own chat UI in the screenshot
  • Or at minimum, not confidently misidentify it as a specific monitored application

Actual Behavior

  • Agent responded: "The screenshot shows the Itemwise app returning an error"
  • Agent read its overview.md memory file and started investigating Itemwise's Container Apps deployment
  • User had to explicitly correct the agent: "No, that's not the itemwise app. That's you! That's another chat thread with you"

Bug 3: No Self-Diagnostic Capability (Low Severity / Feature Request)

After correction, the agent still could not extract meaningful diagnostic information. It acknowledged seeing "another Azure SRE Agent chat thread" but could not read the error text or identify the failure mode.

User: "Based on the image, can you ask yourself what was wrong with Azure SRE Agent?"

Agent: "I'll be honest — I can't read all the fine text in the screenshot clearly enough to pinpoint the exact error or behavior."

Suggestion

Expose a self-diagnostic skill or command (e.g., /diagnose) that allows the agent to query its own conversation history, check model availability/health, and access its own Application Insights telemetry — rather than relying on screenshot OCR.


What Worked Well

The agent's first-run experience was impressive before the error loop:

  • Deep Context: autonomously read READMEs from all 4 connected repos, created structured memory files
  • Investigation: found a real ACR auth issue (34K+ ImagePullBackOff retries) via az containerapp commands
  • Memory system: created accurate team.md and overview.md from the onboarding conversation
  • Onboarding flow: natural and productive "get to know you" experience

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions