Skip to content

Azure SRE Agent template, tools, schedules, and docs#2111

Draft
MSBrett wants to merge 106 commits into
devfrom
features/sre
Draft

Azure SRE Agent template, tools, schedules, and docs#2111
MSBrett wants to merge 106 commits into
devfrom
features/sre

Conversation

@MSBrett
Copy link
Copy Markdown
Contributor

@MSBrett MSBrett commented Apr 28, 2026

Description

Adds a redistributable FinOps SRE Agent template and supporting agent/plugin assets for the FinOps Toolkit. This PR now reflects the current branch state, not the earlier public-preview prototype.

The SRE Agent template lives under src/templates/sre-agent/ and provides a customer-facing deployment path for a FinOps-focused Azure SRE Agent that can inspect FinOps hubs, cost data, capacity and quota risk, budgets, alerts, Advisor suppressions, and executive reporting workflows.

What ships

FinOps SRE Agent template

  • New src/templates/sre-agent/ template with README, catalog, deployment scripts, Bicep infra, recipe assets, CI example, verification script, and AGENTS guardrails.
  • Canonical deployment flow adapted from the Microsoft SRE Agent starter lab pattern: Azure CLI + Bicep for infrastructure, then srectl for post-provisioned agent configuration.
  • No azd dependency. The deployment uses explicit parameters for subscription, resource group, agent name, and location.
  • Subscription-scoped Bicep creates the agent resource group, Log Analytics workspace, Application Insights component, and Microsoft.App/agents resource.
  • Deterministic deployment/resource naming based on subscription ID, resource group ID, and agent name so updates target the same resources instead of creating duplicates.
  • System-managed identity is used across the agent configuration; the deployment grants the agent access to its own managed resource group.
  • Optional FinOps Hub Kusto integration via --cluster-uri and --cluster-resource-id. When supplied for a same-tenant cluster, the template creates the Kusto connector and applies AllDatabasesViewer on the ADX cluster. When omitted, the agent still deploys with all non-Kusto configuration.
  • Network limitation handling for private ADX/Kusto: deployment warns and links customers to the SRE Agent private-network limitation docs instead of skipping required resources.

Agent topology and ownership model

  • 5 custom agents:
    • finops-practitioner owns all scheduled tasks and orchestrates work.
    • ftk-database-query owns FinOps Hub Kusto/cost query tools.
    • azure-capacity-manager owns capacity, quota, SKU availability, database quota, and benefit recommendation Python tools.
    • ftk-hubs-agent owns FinOps Hub infrastructure, Resource Graph, data freshness, budget/alert deployment, and Advisor suppression tools.
    • chief-financial-officer provides executive finance framing and has no operational tools.
  • Only finops-practitioner has handoffs, and it delegates to the four subagents. Subagents do not have subagents.
  • Practitioner and CFO are knowledge-only (SearchMemory), not operational tool-bearing agents.
  • Tool-bearing agents have explicit tool ownership to keep Kusto, capacity, infrastructure, visualization, and delivery responsibilities separated.

Tools, skills, and knowledge

  • 34 custom tools:
    • 21 Kusto tools backed by the FinOps Toolkit query catalog.
    • 13 Python tools for capacity, quota, Resource Graph, budgets, anomaly alerts, Advisor suppression, and data freshness diagnostics.
  • 9 built-in platform tools enabled for log query and visualization support.
  • 3 skills packaged with the recipe:
    • finops-toolkit
    • azure-cost-management
    • azure-capacity-management
  • 6 KnowledgeFile sources uploaded and verified by post-provisioning:
    • chart-artifact-verification.md
    • document-index.md
    • ftk-output-style.md
    • known-issues-and-workarounds.md
    • onboarding-recommendations.md
    • teams-notification-guide.md
  • Shared output style from src/templates/claude-plugin/output-styles/ftk-output-style.md is uploaded into the SRE Agent knowledge base and referenced by every scheduled task.

Scheduled tasks

Ships 19 scheduled tasks, all owned by finops-practitioner:

  • AdvisorSuppressionReview
  • AIWorkloadCostAnalysis
  • AlertCoverageAudit
  • BenefitRecommendationReview
  • BudgetCoverageAudit
  • CapacityDailyMonitor
  • CapacityMonthlyPlanning
  • CapacityQuarterlyStrategy
  • CapacityWeeklySupplyReview
  • ComputeUtilizationTrend
  • CostOptimization
  • DbQuotaAudit
  • HubsHealthCheck
  • MonitoringScopeValidation
  • Monthly
  • NonComputeQuotaAudit
  • SkuAvailabilityAudit
  • StoragePaasGrowthForecast
  • Semiannual

All task prompts include the routing guard, output-style requirement, subagent ownership rules, visualization guidance, and Teams/Outlook delivery instructions. Monthly runs on the 5th of each month. Semiannual runs on January 5 and July 5.

FinOps Framework alignment

  • src/templates/sre-agent/CATALOG.md maps the SRE Agent configuration to the canonical FinOps Framework domains, capabilities, principles, and personas.
  • Capacity management content is framed as implementation detail inside the FinOps Framework rather than as a separate replacement framework.
  • Agent, schedule, and tool ownership are aligned to FinOps practitioner-led workflows with finance/CFO review and specialist operational evidence gathering.

FinOps Toolkit query catalog

  • Adds 21 KQL catalog queries under src/queries/catalog/ and wires them into SRE Agent Kusto tools.
  • Adds supporting FinOps Hub database/query references for the packaged finops-toolkit skill.
  • The SRE Agent recipe follows symlinked/canonical query assets rather than duplicating query definitions by hand.

Claude and Copilot plugin assets

  • Adds/updates Claude plugin agent definitions, commands, and the shared ftk-output-style.md output style.
  • Adds Copilot plugin packaging and agent/command assets that mirror the FinOps practitioner, database query, hubs, and CFO roles.
  • Updates plugin guidance so the FinOps practitioner leads the work and delegates operational evidence collection to the correct specialists.

Documentation

  • Adds Microsoft Learn content under docs-mslearn/toolkit/sre-agent/ covering overview, deployment, template contents, agents, tools, Kusto tools, Python tools, scheduled tasks, knowledge, security, and troubleshooting.
  • Adds docs-mslearn/toolkit/hubs/configure-sre.md and TOC/redirection updates.
  • Updates related FinOps Toolkit docs, framework pages, hub docs, optimization-engine docs, agent skills, and changelog entries to account for the new SRE Agent template and FinOps Framework alignment.

Deployment and validation evidence

Current branch head: 34b0ed2afb2f6cd6142deec6872ebf16e7e3c162.

Validated in a new subscription on 2026-05-28 with no FinOps Hub deployed in that tenant:

bash bin/deploy.sh \
  --recipe recipes/finops-hub \
  --subscription 5763a052-fb17-4b4b-bf10-e56d8eb6f36b \
  --resource-group rg-ftk-sre-5763-uat \
  --name ftk-sre-5763-uat \
  --location eastus2

Result:

  • ARM deployment succeeded.
  • Agent resource, Log Analytics, and Application Insights landed in the target RG.
  • Agent SMID received Contributor, Reader, Monitoring Reader, and Log Analytics Reader on the agent RG.
  • 5 agents landed with the expected handoff topology.
  • 34 custom tools landed.
  • 9 platform log-query/visualization tools were enabled.
  • 3 skills landed.
  • 19 scheduled tasks landed with no duplicates; all owned by finops-practitioner.
  • 6 knowledge sources landed and indexed, including ftk-output-style.md.
  • Kusto connector was intentionally absent because there was no same-tenant FinOps Hub/ADX cluster in the target subscription.
  • The test RG was deleted after validation.

The earlier cross-tenant Hub attempt failed with CrossTenantDeploymentNotPermitted, which is expected: the target subscription was in a different tenant from the existing test ADX cluster, and SMID/UMID access cannot cross that tenant boundary for this deployment. Customers must deploy against a same-tenant Hub/ADX cluster or omit Kusto parameters until the Hub exists.

Review focus

Please focus review on:

  • src/templates/sre-agent/bin/deploy.sh, post-provision.sh, and verify-agent.sh
  • src/templates/sre-agent/infra/**
  • src/templates/sre-agent/recipes/finops-hub/**
  • src/templates/sre-agent/CATALOG.md and README.md
  • docs-mslearn/toolkit/sre-agent/**
  • src/templates/claude-plugin/** and src/templates/copilot-plugin/**
  • src/queries/catalog/**

Not in scope

  • This PR does not deploy a FinOps Hub into every target subscription.
  • This PR does not make cross-tenant ADX/Kusto access work. Same-tenant Hub/Kusto is required for the Kusto connector and ADX role assignment.
  • This PR does not configure customer Teams or Outlook connectors; scheduled tasks use those connectors when configured and otherwise render results in the run output.

Checklist

How did you test this change?

  • Lint / static checks where applicable
  • Bicep/ARM deployment path exercised with Azure CLI
  • Manual live deployment and post-provision validation
  • SRE Agent data-plane validation for agents, tools, skills, schedules, RBAC, and knowledge sources
  • No-Hub deployment path validated
  • Same-tenant Hub/Kusto deployment path needs final customer-environment validation

Deploy to test?

  • SRE Agent only / no Hub
  • SRE Agent + FinOps Hub ADX same-tenant
  • Hubs + ADX managed
  • Hubs + Fabric manual
  • Workbooks
  • Alerts

Documentation

  • Microsoft Learn docs added/updated
  • Template README/catalog/docs added
  • Changelog updated
  • Plugin docs/assets updated

…, UAT-validated scheduled tasks

- Add deploy.sh and deploy.ps1 as single packaged deployment entrypoints
  - --clone-env, --replace, --destroy modes
  - B2B tenant fix: az account set before azd up
  - Clone-env identity leak fix (unset AZURE_ENV_NAME/RG after sourcing)
- Switch agent action mode from Review to Autonomous
  - Review mode blocked all autonomous tool execution including Teams posts
  - Autonomous mode required for scheduled task delivery without human approval
- All 9 scheduled tasks updated with:
  - Phase 0: read knowledge base before starting work
  - Teams/knowledge split: financial results to Teams only, operational learnings to knowledge only
  - PostTeamsMessage delivery instruction in every task
- Fix duplicate task bug: post-provision uses srectl scheduledtask apply (idempotent)
- Add 3 knowledge docs for agent self-improvement:
  - onboarding-recommendations.md
  - teams-notification-guide.md (PostTeamsMessage vs 403 on Graph API)
  - known-issues-and-workarounds.md (7 error categories from UAT)
- Enable experimentalSettings (EnableV2AgentLoop, EnableWorkspaceTools)
- Enable execute_python on analytical subagents
- README updated as living product contract
- 17 tests passing (TC-4.1a, TC-4.2a enforce packaging and delivery contracts)

UAT validated: all 9 scheduled tasks executed and delivered to Teams channel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@microsoft-github-policy-service microsoft-github-policy-service Bot added Needs: Documentation 📝 PR needs a changelog entry, public docs, or internal docs before it can be closed Needs: Review 👀 PR that is ready to be reviewed and removed Needs: Documentation 📝 PR needs a changelog entry, public docs, or internal docs before it can be closed labels Apr 28, 2026
…wledge/Teams split, memory model fix

- Add docs-mslearn/toolkit/hubs/configure-sre.md (duo-reviewed against configure-ai.md and AGENTS.md)
- Add to TOC.yml after Configure AI in Copilot Studio
- Switch Bicep from Review to Autonomous mode
- Fix README to reflect Autonomous mode
- Add 3 knowledge docs: onboarding, Teams notification guide, known issues and workarounds
- Fix scheduled task persistence: use #remember for operational notes, not git commits
- Replace 'save to knowledge base' with #remember in all 9 task prompts
- Add Phase 0 (read knowledge before starting) to all 9 tasks
- Enforce knowledge/Teams split: financial results to Teams only, operational learnings to memory only
- Fix post-provision: srectl scheduledtask apply (idempotent) instead of create
- Add known-issues-and-workarounds.md section 0: correct persistence model

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@microsoft-github-policy-service microsoft-github-policy-service Bot added Needs: Documentation 📝 PR needs a changelog entry, public docs, or internal docs before it can be closed and removed Needs: Documentation 📝 PR needs a changelog entry, public docs, or internal docs before it can be closed labels Apr 28, 2026
@MSBrett MSBrett requested a review from Copilot April 28, 2026 19:22
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@microsoft-github-policy-service microsoft-github-policy-service Bot added Needs: Documentation 📝 PR needs a changelog entry, public docs, or internal docs before it can be closed and removed Needs: Documentation 📝 PR needs a changelog entry, public docs, or internal docs before it can be closed labels Apr 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new redistributable FinOps SRE Agent template under src/templates/sre-agent/, including deploy automation, Bicep infrastructure, Kusto tool catalog, scheduled tasks, knowledge docs, and template validation tests, plus an MSLearn doc entry for configuring the agent with FinOps hubs.

Changes:

  • Introduces 21 Kusto-backed ExtendedAgentTool definitions and wires them into multiple autonomous scheduled tasks.
  • Adds packaged deployment and post-provision automation (deploy.*, post-provision.*) plus full Bicep infra modules (agent, identity, monitoring, RBAC, optional ADX role).
  • Adds template tests and documentation updates (template README, catalog, UAT plan, and MSLearn article + TOC entry).

Reviewed changes

Copilot reviewed 66 out of 66 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/templates/sre-agent/tools/top-services-by-cost.yaml New Kusto tool for top services by effective cost.
src/templates/sre-agent/tools/top-resource-types-by-cost.yaml New Kusto tool for top resource types by cost and count.
src/templates/sre-agent/tools/top-resource-groups-by-cost.yaml New Kusto tool for top resource groups by effective cost.
src/templates/sre-agent/tools/top-other-transactions.yaml New Kusto tool to surface large non-usage/non-commitment purchases.
src/templates/sre-agent/tools/top-commitment-transactions.yaml New Kusto tool for commitment purchase transactions with helper columns.
src/templates/sre-agent/tools/service-price-benchmarking.yaml New Kusto tool to benchmark list/contracted/effective costs and savings.
src/templates/sre-agent/tools/savings-summary-report.yaml New Kusto tool to summarize savings and effective savings rate.
src/templates/sre-agent/tools/reservation-recommendation-breakdown.yaml New Kusto tool to analyze reservation recommendations and break-even.
src/templates/sre-agent/tools/quarterly-cost-by-resource-group.yaml New Kusto tool for quarterly-style RG cost rows.
src/templates/sre-agent/tools/monthly-cost-trend.yaml New Kusto tool for monthly billed/effective cost trends.
src/templates/sre-agent/tools/monthly-cost-change-percentage.yaml New Kusto tool for month-over-month cost change percentages.
src/templates/sre-agent/tools/costs-enriched-base.yaml New “base” enriched Costs() query for broader analytics.
src/templates/sre-agent/tools/cost-forecasting-model.yaml New Kusto time-series forecasting tool.
src/templates/sre-agent/tools/cost-by-region-trend.yaml New Kusto tool for regional effective-cost distribution.
src/templates/sre-agent/tools/cost-by-financial-hierarchy.yaml New Kusto tool for hierarchy showback with percent-of-total.
src/templates/sre-agent/tools/cost-anomaly-detection.yaml New Kusto anomaly detection time-series tool.
src/templates/sre-agent/tools/commitment-discount-utilization.yaml New Kusto tool for commitment utilization mix and percent-of-total.
src/templates/sre-agent/tools/ai-token-usage-breakdown.yaml New Kusto tool for Azure OpenAI token/cost breakdown and unit costs.
src/templates/sre-agent/tools/ai-model-cost-comparison.yaml New Kusto tool comparing cost per 1K tokens by model.
src/templates/sre-agent/tools/ai-daily-trend.yaml New Kusto tool for daily AI token/cost trends.
src/templates/sre-agent/tools/ai-cost-by-application.yaml New Kusto tool allocating AI spend by tags (app/team/env/cost center).
src/templates/sre-agent/tests/sprint1-artifacts.test.mjs Adds template validation tests for agents/skills/connectors/infra/scripts/tasks.
src/templates/sre-agent/sre-config/scheduled-tasks/ytd-report.yaml New autonomous YTD finance reporting scheduled task prompt.
src/templates/sre-agent/sre-config/scheduled-tasks/mom-report.yaml New autonomous MoM report scheduled task prompt.
src/templates/sre-agent/sre-config/scheduled-tasks/hubs-health-check.yaml New daily hubs health-check scheduled task prompt.
src/templates/sre-agent/sre-config/scheduled-tasks/cost-optimization.yaml New cost optimization scheduled task prompt (advisor/orphaned/commitments).
src/templates/sre-agent/sre-config/scheduled-tasks/capacity-weekly-supply-review.yaml New weekly capacity supply review scheduled task prompt.
src/templates/sre-agent/sre-config/scheduled-tasks/capacity-quarterly-strategy.yaml New quarterly capacity strategy scheduled task prompt.
src/templates/sre-agent/sre-config/scheduled-tasks/capacity-monthly-planning.yaml New monthly capacity planning scheduled task prompt.
src/templates/sre-agent/sre-config/scheduled-tasks/capacity-daily-monitor.yaml New daily capacity monitor scheduled task prompt.
src/templates/sre-agent/sre-config/scheduled-tasks/ai-workload-cost-analysis.yaml New monthly AI workload cost analysis scheduled task prompt.
src/templates/sre-agent/sre-config/knowledge/teams-notification-guide.md Adds guidance for Teams delivery via built-in PostTeamsMessage tool.
src/templates/sre-agent/sre-config/knowledge/onboarding-recommendations.md Adds onboarding guidance and connector setup constraints.
src/templates/sre-agent/sre-config/knowledge/known-issues-and-workarounds.md Adds UAT-derived known issues/workarounds and data persistence rules.
src/templates/sre-agent/sre-config/connectors/finops-hub-kusto.yaml Adds DataConnector YAML for FinOps Hub Kusto connector contract reference.
src/templates/sre-agent/sre-config/agents/ftk-hubs-agent.yaml Adds/updates hubs specialist agent definition.
src/templates/sre-agent/sre-config/agents/ftk-database-query.yaml Adds/updates KQL specialist agent with tool access list.
src/templates/sre-agent/sre-config/agents/finops-practitioner.yaml Adds/updates FinOps practitioner agent with tools and handoffs.
src/templates/sre-agent/sre-config/agents/chief-financial-officer.yaml Adds/updates CFO agent with tools for finance narratives.
src/templates/sre-agent/sre-config/agents/azure-capacity-manager.yaml Adds/updates capacity manager agent with quota/capacity guardrails.
src/templates/sre-agent/scripts/post-provision.sh Adds bash post-provision automation to apply repo artifacts via srectl.
src/templates/sre-agent/scripts/post-provision.ps1 Adds PowerShell post-provision automation to apply repo artifacts via srectl.
src/templates/sre-agent/scripts/deploy.sh Adds packaged bash deploy wrapper around azd environment workflow.
src/templates/sre-agent/scripts/deploy.ps1 Adds packaged PowerShell deploy wrapper around azd environment workflow.
src/templates/sre-agent/plan.md Adds UAT plan and status tracking for template completeness.
src/templates/sre-agent/infra/bicep/resources.bicep Adds RG-scoped orchestrator wiring identity/monitoring/agent modules.
src/templates/sre-agent/infra/bicep/modules/subscription-rbac.bicep Adds subscription-scope role assignments (Reader + Monitoring Contributor).
src/templates/sre-agent/infra/bicep/modules/sre-agent.bicep Adds SRE Agent resource deployment module + optional data connector resource.
src/templates/sre-agent/infra/bicep/modules/monitoring.bicep Adds Log Analytics + App Insights resources for agent telemetry.
src/templates/sre-agent/infra/bicep/modules/identity.bicep Adds UAMI creation module and outputs.
src/templates/sre-agent/infra/bicep/modules/adx-role.json Adds ARM JSON for ADX AllDatabasesViewer principal assignment (portal flow).
src/templates/sre-agent/infra/bicep/modules/adx-role.bicep Adds Bicep module to assign ADX AllDatabasesViewer.
src/templates/sre-agent/infra/bicep/main.parameters.json Adds azd parameter mapping for main.bicep.
src/templates/sre-agent/infra/bicep/main.json Checks in compiled ARM template output for subscription deployment.
src/templates/sre-agent/infra/bicep/main.bicep Adds subscription entrypoint: RG creation, modules, RBAC, optional ADX role.
src/templates/sre-agent/azure.yaml Adds azd template definition and postprovision hooks.
src/templates/sre-agent/README.md Adds template README with deploy instructions, verification, and connector guidance.
src/templates/sre-agent/CATALOG.md Adds scheduled task catalog and tool gap analysis.
src/templates/sre-agent/AGENTS.md Adds agent instruction metadata and example connection details.
docs-mslearn/toolkit/hubs/configure-sre.md Adds MSLearn article documenting deployment/config of SRE agent with hubs + Teams.
docs-mslearn/TOC.yml Adds TOC entry for the new “Configure an SRE agent” doc.
.gitmodules Adds azcapman submodule pointer for capacity-management lab content.

Comment thread src/templates/sre-agent/AGENTS.md Outdated
Comment thread src/templates/sre-agent/scripts/post-provision.sh Outdated
Comment thread src/templates/sre-agent/scripts/post-provision.ps1 Outdated
Comment thread src/templates/sre-agent/infra/bicep/modules/sre-agent.bicep Outdated
Comment thread src/templates/sre-agent/README.md Outdated
Comment thread src/templates/sre-agent/infra/bicep/main.json Outdated
- Remove real subscription/tenant IDs from AGENTS.md, replace with placeholders
- Fix sre-agent.bicep comment: Review → Autonomous
- Fix README architecture diagram: Review → Autonomous
- Regenerate main.json from Bicep (Autonomous mode)
- Remove PyYAML dependency from post-provision scripts (use grep/Select-String)
- Fix KQL isempty()/isnotempty() on integer fields → isnull()/isnotnull()
- Tighten #remember in MOM task to exclude financial figures
- Update ms.date on changelog to 04/28/2026

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@MSBrett
Copy link
Copy Markdown
Contributor Author

MSBrett commented Apr 28, 2026

@copilot re-review please — all 11 findings addressed in commit 9f5ba45. 8 fixed directly, 3 acknowledged as follow-ups (tool parameter specificity in MOM/YTD prompts, required:true metadata in costs-enriched-base).

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 28, 2026

@MSBrett I'm unable to start working on this because of repository rules that prevent me from pushing to the branch:

  • Changes must be made through a pull request due to branch protection rules

See the documentation for more details.

msbrett and others added 7 commits April 28, 2026 17:42
…custom zone-peers role

Sprint 3: 14 Python tools built via TDD gate pipeline (14/14 done).

Capacity tools (from azcapman scripts):
- vm-quota-usage: ARM REST compute usages with warning/critical thresholds
- zone-mapping: ARM REST checkZonePeers for cross-sub zone alignment
- capacity-reservation-groups: ARM REST CRG list with instanceView utilization
- sku-availability: ARM REST compute SKUs with restriction parsing
- data-freshness-check: Kusto REST querying Hub functions (Costs/Prices/Recommendations/Transactions)
- resource-graph-query: ARM REST Resource Graph for orphaned resource detection
- non-compute-quotas: ARM REST Storage/Network usages + Resource Graph counts
- spot-placement-scores: ARM REST spot placement scores API
- benefit-recommendations: Cost Management REST benefitRecommendations with pagination

Governance tools (from azcapman scripts):
- deploy-budget: ARM REST subscription budget creation with validation
- deploy-bulk-budgets: Resource Graph discovery + per-sub budget deployment
- deploy-anomaly-alert: Cost Management scheduled actions for anomaly detection
- deploy-bulk-anomaly-alerts: Resource Graph discovery + per-sub alert deployment
- suppress-advisor-recommendations: Advisor REST suppression with GUID-validated IDs

Architecture decisions:
- All tools use requests + ARM REST (not azure-mgmt-* SDKs — not pre-installed in sandbox)
- All tools use ManagedIdentityCredential(client_id) for UAMI auth with DefaultAzureCredential fallback
- Entry point is def main(**kwargs) per platform harness requirement
- Write tools include input validation (GUID, email, enum, positive amount)
- Custom role 'FinOps SRE Zone Peers Reader' added for checkZonePeers/action
- data-freshness-check queries Hub functions (Costs/Prices/Recommendations/Transactions) not .show tables

UAT: 6/7 tools pass on live agent. zone-mapping needs the custom role deployed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… version

- Removed zone-mapping: requires AvailabilityZonePeering feature + custom role
  that cannot be verified end-to-end in this session. Documented the feature
  registration prerequisite in README for future reintroduction.
- Removed spot-placement-scores: requires Microsoft.Compute/locations/
  placementScores/generate/action not included in Reader role.
- Fixed deploy-bulk-budgets, deploy-bulk-anomaly-alerts, suppress-advisor-
  recommendations: replaced azure.mgmt.resourcegraph SDK (not installed in
  sandbox) with ARM REST API for Resource Graph queries.
- Fixed spot-placement-scores API version from 2024-11-01 (nonexistent) to
  2024-06-01-preview before removal.
- Moved custom role creation from Bicep to post-provision.sh for management
  group scope flexibility.
- Added AvailabilityZonePeering feature registration to README prerequisites.
- Verified vm-quota-usage returns 215 real quotas on live agent with UAMI auth.
- All 12 remaining Python tools pass local testing (14/14 local, 12 deployed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ew capacity audits

T-4.1: Wired 12 Python tools into all 9 existing scheduled tasks. Each task
prompt now references the relevant Python tools by name and instructs the
agent to pass client_id for UAMI authentication. Replaced az CLI fallback
references with Python tool equivalents.

T-4.2: Created 9 new scheduled tasks from azcapman lab mapping:
- ComputeUtilizationTrend (weekly): VM quota utilization trends
- StoragePaasGrowthForecast (monthly): storage/PaaS growth projection
- SkuAvailabilityAudit (weekly): SKU restrictions across regions
- NonComputeQuotaAudit (weekly): storage/network quota health
- BudgetCoverageAudit (monthly): subscriptions missing budgets
- AlertCoverageAudit (monthly): subscriptions missing anomaly alerts
- MonitoringScopeValidation (weekly): Hub coverage vs active subscriptions
- BenefitRecommendationReview (weekly): reservation/savings plan recommendations
- AdvisorSuppressionReview (monthly): stale/expired Advisor suppressions

T-4.3: All 18 tasks deployed to live agent via srectl scheduledtask apply.
Old duplicates cleaned up. Converted all tasks to v1 YAML format (snake_case)
required by the srectl API.

T-4.4: README updated with final inventory — 33 tools (21 Kusto + 12 Python),
18 scheduled tasks.

T-4.5: post-provision.sh already uses directory glob iteration — no changes
needed for new tasks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…se and attempt git commits

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…emove set -e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-provision tests

- Added EnableCodeInterpreter and EnableMemory to experimentalSettings
- Updated tests: removed repo connector assertions, relaxed apply-yaml to accept tool/agent apply

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation as portal step

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 117 out of 117 changed files in this pull request and generated 5 comments.

Comment on lines +70 to +72
experimentalSettings: {
EnableWorkspaceTools: true
}
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experimentalSettings is missing EnableV2AgentLoop: true. The repo test suite expects both EnableV2AgentLoop and EnableWorkspaceTools enabled for this template, and leaving it out will break TC-5.2 (and may prevent the newer agent loop behavior the template relies on). Add EnableV2AgentLoop: true alongside EnableWorkspaceTools: true.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +12
api_version: azuresre.ai/v1
kind: ScheduledTask
metadata:
name: YTD
owner:
version:
tags:
spec:
name: YTD
cron_expression: ''
agent_prompt: >-
# Fiscal Year-to-Date Analysis
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sre-config/scheduled-tasks/scheduledtasks/** YAMLs appear to be exported runtime/task-instance artifacts (BOM at start of file, blank owner/version/tags, cron_expression: '', plus fields like start_time, duration_hours, and a separate cron field). They don’t match the canonical scheduled task definitions in sre-config/scheduled-tasks/*.yaml and (since post-provision doesn’t recurse) won’t be applied anyway. Consider removing this exported scheduledtasks/ tree from the redistributable template, or converting it into a clearly separated example/export location so it can’t be mistaken for deployable config.

Copilot uses AI. Check for mistakes.
spec:
name: YTD
description: Fiscal year-to-date analysis with forecast through end of fiscal year
cron_expression: 0 9 1 * *
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is scheduled for 0 9 1 * *, which is the same cron as other heavyweight monthly tasks (e.g., CapacityMonthlyPlanning and AdvisorSuppressionReview). If these run concurrently they can compete for tool quotas, increase execution time, and spam the Teams channel at the same time. Consider staggering monthly tasks (different minutes/hours) to reduce concurrency and improve reliability.

Copilot uses AI. Check for mistakes.
spec:
name: CapacityMonthlyPlanning
description: Monthly capacity planning cycle — forecast demand, procurement pipeline, governance review
cron_expression: 0 9 1 * *
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is scheduled for 0 9 1 * *, which is the same cron as other monthly tasks (e.g., YTD and AdvisorSuppressionReview). Running multiple long tasks at the same time can cause tool throttling and longer runtimes. Consider staggering the cron schedules to reduce concurrency.

Copilot uses AI. Check for mistakes.
spec:
name: AdvisorSuppressionReview
description: Monthly review of active Advisor recommendation suppressions for stale or expired decisions
cron_expression: 0 9 1 * *
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is scheduled for 0 9 1 * *, which overlaps with other monthly tasks (e.g., CapacityMonthlyPlanning and YTD). Consider staggering the cron schedule to avoid multiple monthly reports running concurrently and posting to Teams at the same time.

Copilot uses AI. Check for mistakes.
msbrett and others added 4 commits April 29, 2026 10:15
Add --dry-run flag to post-provision.sh and -DryRun switch to
post-provision.ps1. When enabled, scripts log every srectl/az command
that would execute without actually calling them. Output uses
[DRY-RUN] prefix per category (skill, agent, tool, knowledge,
scheduled task).

Additional fixes from red-team audit:
- Remove Add-RepoConnector from PS1 (parity with .sh safety removal;
  agent was searching full codebase and attempting git commits)
- Unknown arguments now fail closed in .sh instead of being ignored

Includes TDD test suite:
- tests/test-dry-run.sh (shell contract test)
- tests/Test-DryRun.Tests.ps1 (Pester 5 contract test, 5 cases)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create new top-level docs section at docs-mslearn/toolkit/sre-agent/
with 9 pages covering the full FinOps SRE Agent feature surface:

- overview.md: what it is, architecture, cost estimate
- deploy.md: prerequisites, azd up, notifications, dry-run, regions
- agents.md: 5 subagents, handoff model, 3 skills
- tools.md: 33 tools (21 Kusto + 12 Python) grouped by domain
- scheduled-tasks.md: 18 deployed tasks by cadence
- knowledge.md: knowledge docs, memory system, persistence
- security.md: RBAC, managed identity, least privilege, run modes
- troubleshooting.md: B2B tenants, known issues, workarounds
- template.md: Bicep params, azd outputs, script flags

Also:
- Add FinOps SRE Agent section to TOC after FinOps hubs
- Remove old Configure SRE agent entry from hubs TOC
- Add redirect from hubs/configure-sre to sre-agent/overview
- Update cross-links in changelog and other docs
- Set ms.reviewer to brettwil (micflan no longer at MS)

Audited against MS Learn metadata requirements and
partner-center-pr/finops/docfx.json globalMetadata contract.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…udio

Add Azure SRE Agent and Azure Copilot Studio as AI integrations
alongside the existing GitHub Copilot in the FinOps hubs architecture
diagram. All three query the ADX / Fabric normalized data layer.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	docs-mslearn/toolkit/changelog.md
msbrett and others added 30 commits May 27, 2026 20:43
Adds src/queries/catalog/percentage-untagged-costs.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.1. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/percentage-unallocated-costs.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.2. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/allocation-accuracy-index.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.3. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/tagging-policy-compliance.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.4. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/anomaly-detection-rate.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.5. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/anomaly-variance-total.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.6. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/cost-visibility-delay.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.7. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/data-update-frequency.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.8. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/commitment-utilization-score.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.9. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/commitment-discount-waste.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.10. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…counts query (T-3000.11)

Adds src/queries/catalog/compute-spend-commitment-coverage.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.11. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
….12)

Adds src/queries/catalog/compute-cost-per-core.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.12. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/cost-optimization-index.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.13. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/cost-per-gb-stored.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.14. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/macc-consumption-vs-commitment.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.15. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds src/queries/catalog/storage-tier-distribution.kql — KQL implementation for the
FinOps Foundation KPI tracked by sprint task T-3000.16. Reviewed
by Sonnet 4.6 red-team (Wave 4) and advanced through Gate.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds INDEX.md scenario rows and KPI.md mappings for the 16 new
Hub KQL queries delivered in Sprint 3000. KPI.md is regenerated
from the FinOps Foundation Algolia KPI index with explicit
toolkit-query mappings only where a query directly computes the
published formula.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, .13, .15)

Sprint 3000 wave-boundary reviews (gpt-5.5 + claude-opus-4.6) surfaced
three correctness findings on the 16 newly-landed Hub KQL queries.
This commit applies the targeted fixes; both reviewers re-verified RESOLVED.

T-3000.12 compute-cost-per-core.kql:
  gpt-5.5 flagged that UnusedCommitmentCost previously summed ALL unused
  commitment rows, including non-compute (storage/database/etc.) commitment
  waste, into the compute cost-per-core denominator. Added tmp_IsComputeCommitment
  predicate scoping the sumif to x_SkuMeterCategory in
  ('Virtual Machines', 'Virtual Machine Licenses').

T-3000.13 cost-optimization-index.kql:
  gpt-5.5 flagged that grouping by BillingCurrency repeated the single global
  Recommendations() savings scalar across every currency, producing misleading
  per-currency COIN scores for multi-currency hubs. Recommendations() has no
  BillingCurrency column (schema guide lines 591-606). Dropped the
  by BillingCurrency clause; output is now a single hub-wide row with
  ReportingCurrencyScope = 'HubWide'. Operators of multi-currency hubs are
  advised in Scope Notes to normalize Costs() to a single currency first.

T-3000.15 macc-consumption-vs-commitment.kql:
  opus-4.6 flagged that the output substituted BillingAccountId for the
  spec-required BillingProfileId. gpt-5.5 flagged that naive broadcasting
  of the account-level CommitmentAmount across profile rows would duplicate
  the same commitment value. Fix: collect AccountCommitment from Transactions()
  (which lacks x_BillingProfileId) at the account level, then allocate
  proportionally across profiles by ProfileShare = ProfileConsumption /
  AccountConsumption. Per-profile CommitmentAmount values now sum exactly
  to AccountCommitment. Output projects BillingProfileId per spec.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Live-hub UAT against msbwftktreyhub.westus.kusto.windows.net/Hub surfaced
a Kusto SemanticException SEM0019 in percentage-unallocated-costs.kql:

    Call to iff(): @then data type (real) must match the @else data type (decimal)

On line 38 the unallocated-cost summarize used decimal(0) as the iff() else
literal while EffectiveCost is typed real in the live Hub schema. Branch types
must match; replaced decimal(0) with real(0). Re-UAT returned 1 row with the
expected 4-column schema (BillingCurrency, UnallocatedEffectiveCost,
TotalEffectiveCost, UnallocatedPercent). No other changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Live-hub UAT against msbwftktreyhub.westus.kusto.windows.net/Hub surfaced the
same Kusto SemanticException SEM0019 in commitment-discount-waste.kql:

    Call to iff(): @then data type (real) must match the @else data type (decimal)

Four occurrences: two iff(..., EffectiveCost, decimal(0)) inside the
commitmentWaste and grandTotals summarize blocks, and two
iff(TotalCost == decimal(0), 0.0, ...) scalar comparisons that mixed
decimal with double. Replaced decimal(0) with real(0) inside the summarize
and with 0.0 in the scalar comparisons.

Re-UAT returned ZERO_ROWS — the query now parses and executes cleanly with
the spec-required 7-column schema, but the trey-hub test fixture holds no
records matching isnotempty(CommitmentDiscountId) in the 30-day window
(same data-availability gap as T-3000.9 commitment-utilization-score). The
numeric path remains to be re-UAT'd on a commitment-active hub.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
UAT executed on 2026-05-28 against msbwftktreyhub.westus.kusto.windows.net/Hub
(sentinel: 1,366,763 cost rows in 2026-04 window, USD only).

Updated 'Last Tested:' header on all 16 Sprint 3000 catalog queries with
the date, target cluster, sentinel row count, and per-query outcome.

UAT outcome summary:
  - 13 PASS (T-3000.1, 2, 3, 4, 5, 6, 8, 11, 12, 13, 14, 15, 16)
  - 2 ZERO_ROWS with schema match (T-3000.9, T-3000.10) — data-availability
    gap; the trey-hub fixture holds no commitment-discount records in the
    30-day window. Queries themselves are spec-correct; numeric path
    requires re-UAT on a commitment-active hub.
  - 1 PASS_WITH_NOTE (T-3000.7) — query returns the 7 spec-required columns
    plus 4 documented *DelayDays extras already declared in the file's own
    Output: header (P50/P90/P99/MaxDelayDays). Internally consistent; extras
    are intentional operational utility, not a defect.

SEM0019 decimal/real fixes were committed separately for T-3000.2 and
T-3000.10 ahead of this metadata update.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-3000.13)

Live-hub UAT against msbwftktreyhub.westus.kusto.windows.net/Hub surfaced a
silent data-loss bug in cost-optimization-index.kql.

Root cause: the query filtered Recommendations() by
`x_RecommendationDate >= startDate and x_RecommendationDate < endDate`,
but the schema reality is that x_RecommendationDate is null on every
reservation-recommendation row in the live Hub (verified empirically:
17/17 rows null in trey-hub). The KQL null-vs-datetime comparison evaluates
false, so the predicate silently dropped all 17 ReservationRecommendations
rows holding $18,976.30 of savings opportunity. Net effect:
TotalSavingsOpportunity=0 → COIN=100 (false-positive perfect score).

Fix: remove the x_RecommendationDate window filter. Schema analysis
(gpt-5.5 ftk-database-query against trey-hub) confirmed Recommendations()
has no reliable business time grain — x_IngestionTime is ingestion-only,
x_RecommendationDate is documented but commonly null. Reservation
recommendations are point-in-time optimization snapshots, not dated events,
so aggregating the current recommendation set against the windowed cost
amortization is the canonically-correct shape. Cost denominator remains
windowed; only the savings numerator is now time-grain-naive (matching the
data reality).

Re-UAT result (live trey-hub, file as committed):
  ReportingCurrencyScope=HubWide
  TotalSavingsOpportunity=18976.29672832971
  TotalCost=810042.1899998859
  COINScore=97.65736933673388

Header + Scope Notes updated to document the time-grain decision and the
empirical x_RecommendationDate null finding.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…covery)

Discovered during Sprint 3000 UAT against live trey-hub
(msbwftktreyhub.westus.kusto.windows.net) — schema-guide drift caused
the T-3000.13 time-grain bug and the SEM0019 class. Three corrections,
all empirically verified against the live Hub via `getschema` probes.

1. Numeric column types (`Costs()` 26 cols, `Prices()` 13 cols,
   `Recommendations()` 3 cols): `decimal` -> `real` to match deployed
   Hub schema. Per KQL, `real` is the 64-bit floating-point type
   (.NET System.Double); the live Hub exposes all FOCUS numeric columns
   as `real`, not the 128-bit fixed-precision `decimal` previously
   declared in this guide. This mismatch was the root cause of the
   SEM0019 errors fixed earlier this sprint in T-3000.2 and T-3000.10
   (the iff(...) false-branch needs `real(0)`, not `decimal(0)`).

2. `Recommendations()` columns: expanded from 12 to 20 to add the 8
   columns present in the live schema but missing from the guide —
   ResourceId, ResourceName, ResourceType, SubAccountName,
   x_RecommendationCategory, x_RecommendationDescription,
   x_RecommendationId, x_ResourceGroupName.

3. `x_RecommendationDate` documented as commonly-null in live Hubs.
   This was the root cause of T-3000.13: a date-window predicate
   silently dropped all 17 reservation-recommendation rows because
   KQL `null >= datetime(...)` evaluates false. Guide now warns
   against the pattern and points authors to `x_IngestionTime` as
   the time anchor.

Added 'real vs decimal' callout near the Table reference header so
future authors don't repeat the SEM0019 mistake. Added a
'sparsely-populated columns' callout on `Recommendations()` so
authors verify column population with `isnotempty()` before relying
on a column as a join key or filter.

Scope held tight per user direction (this sprint = data available
today via the Hub db): `Transactions()` table NOT modified because
the live trey-hub has zero `Transactions()` rows — types cannot be
empirically verified, so no claim is changed. MACC commitment-balance
binding NOT added for the same reason.

Source: live-Hub `getschema` probes + 3-row Recommendations() sample
via uat-sprint-3000 agent on 2026-05-28. Change log entry recorded
as v1.2.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Review 👀 PR that is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants