[GA Day 1] Unrecoverable model error loop, agent misidentifies own UI, no self-diagnostic capability

## Summary

Three bugs encountered on **Day 1 of Azure SRE Agent GA** (March 11, 2026). Created agent "sre-sandbox" in eastus2, connected 4 GitHub repos via OAuth, and exercised the new onboarding flow. The agent performed an impressive initial investigation but then entered an unrecoverable error state.

## Environment

| Component | Value |
|-----------|-------|
| Agent name | sre-sandbox |
| Agent type | GA (created March 11, 2026) |
| Region | eastus2 |
| Connected repos | 4 repos via OAuth |
| Logs connector | None (using built-in Azure Monitor) |
| Browser | Microsoft Edge on Windows 11 |

---

## Bug 1: Unrecoverable "Internal Model Connection Error" Loop (High Severity)

After a successful onboarding conversation and autonomous investigation, the agent entered an unrecoverable error state where every subsequent user message triggers "a temporary AI model connection error." The agent never recovered, even after multiple retries over ~10 minutes.

### Steps to Reproduce
1. Created a new GA SRE Agent at sre.azure.com
2. Connected 4 GitHub repos via OAuth, configured resource group access
3. Agent initiated onboarding: read READMEs, created memory files (`team.md`, `overview.md`), ran Azure CLI commands (`az containerapp list`, `az containerapp show`), and produced a detailed investigation report
4. User sent follow-up messages asking about cold start behavior and resource group cleanup
5. **Every subsequent agent response failed** with "temporary AI model connection error"
6. Auto-retry mechanism triggered 3 times — all failed
7. Thread became permanently stuck; no further agent responses possible

### Expected Behavior
- Agent should recover from transient model connection errors via retry/backoff
- If the backing model is unavailable, provide a clear status message
- The conversation thread should remain functional after recovery

### Actual Behavior
- Agent entered a permanent error loop with no recovery
- User had to abandon the thread entirely and start a new one
- No error details exposed — just a generic "temporary AI model connection error"
- The auto-retry messages also all failed

### Timestamps (PT)
- ~1:18 PM: Successful onboarding + investigation completed
- ~1:24 PM: First user follow-up message → "internal error"
- ~1:25 PM: Auto-retry #1 → failed
- ~1:25 PM: Auto-retry #2 → failed
- ~1:33 PM: Auto-retry #3 → failed
- Thread permanently stuck after this point

---

## Bug 2: Agent Misidentifies Its Own UI as a Customer App (Medium Severity)

When the user shared a screenshot of the broken chat thread (Bug 1) in a new conversation, the agent misidentified its own portal UI as the "Itemwise app" and attempted to investigate a non-existent application error.

### Steps to Reproduce
1. Take a screenshot of the broken SRE Agent chat (showing the error loop from Bug 1)
2. Start a new chat thread in the same SRE Agent
3. Send the screenshot to the agent

### Expected Behavior
- Agent should recognize its own chat UI in the screenshot
- Or at minimum, not confidently misidentify it as a specific monitored application

### Actual Behavior
- Agent responded: "The screenshot shows the Itemwise app returning an error"
- Agent read its `overview.md` memory file and started investigating Itemwise's Container Apps deployment
- User had to explicitly correct the agent: "No, that's not the itemwise app. That's you! That's another chat thread with you"

---

## Bug 3: No Self-Diagnostic Capability (Low Severity / Feature Request)

After correction, the agent still could not extract meaningful diagnostic information. It acknowledged seeing "another Azure SRE Agent chat thread" but could not read the error text or identify the failure mode.

**User:** "Based on the image, can you ask yourself what was wrong with Azure SRE Agent?"

**Agent:** "I'll be honest — I can't read all the fine text in the screenshot clearly enough to pinpoint the exact error or behavior."

### Suggestion
Expose a self-diagnostic skill or command (e.g., `/diagnose`) that allows the agent to query its own conversation history, check model availability/health, and access its own Application Insights telemetry — rather than relying on screenshot OCR.

---

## What Worked Well

The agent's first-run experience was impressive before the error loop:
- **Deep Context**: autonomously read READMEs from all 4 connected repos, created structured memory files
- **Investigation**: found a real ACR auth issue (34K+ ImagePullBackOff retries) via `az containerapp` commands
- **Memory system**: created accurate `team.md` and `overview.md` from the onboarding conversation
- **Onboarding flow**: natural and productive "get to know you" experience


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GA Day 1] Unrecoverable model error loop, agent misidentifies own UI, no self-diagnostic capability #98

Summary

Environment

Bug 1: Unrecoverable "Internal Model Connection Error" Loop (High Severity)

Steps to Reproduce

Expected Behavior

Actual Behavior

Timestamps (PT)

Bug 2: Agent Misidentifies Its Own UI as a Customer App (Medium Severity)

Steps to Reproduce

Expected Behavior

Actual Behavior

Bug 3: No Self-Diagnostic Capability (Low Severity / Feature Request)

Suggestion

What Worked Well

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Value
Agent name	sre-sandbox
Agent type	GA (created March 11, 2026)
Region	eastus2
Connected repos	4 repos via OAuth
Logs connector	None (using built-in Azure Monitor)
Browser	Microsoft Edge on Windows 11

[GA Day 1] Unrecoverable model error loop, agent misidentifies own UI, no self-diagnostic capability #98

Description

Summary

Environment

Bug 1: Unrecoverable "Internal Model Connection Error" Loop (High Severity)

Steps to Reproduce

Expected Behavior

Actual Behavior

Timestamps (PT)

Bug 2: Agent Misidentifies Its Own UI as a Customer App (Medium Severity)

Steps to Reproduce

Expected Behavior

Actual Behavior

Bug 3: No Self-Diagnostic Capability (Low Severity / Feature Request)

Suggestion

What Worked Well

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions