[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-13 #31919

2026-05-13T10:58:26Z

github-actions[bot]
Bot May 13, 2026

Summary

Analysis Period: 2026-04-24 → 2026-05-13 (~19 days)
Total Copilot PRs Analyzed: 1000
Overall Merge Rate: 79.5% (795 merged / 205 closed-without-merge)
Clusters Identified: 7 (k-means on TF-IDF, optimal silhouette in 3–10 range)
Data Sources: PR bodies + titles from /tmp/gh-aw/prompt-cache/pr-full-data/

Key Findings

Two clusters dominate — Workflow Compilation (30.4%) and Testing & Test Coverage (29.7%) together account for ~60% of all Copilot tasks. Together they reflect the project's heavy reliance on the agent for lock-file recompilation and Go-package test fixes.
AI Engine Integration is the weakest cluster at 62.5% merge rate — and the most expensive (avg 93 files changed, +1020/-553 lines, 6.0 comments/PR). These are big AWF version bumps, retries, firewall steering changes — high blast radius, more revisions.
Caching is the strongest cluster at 88.1% merge rate with the smallest footprint (10.5 files, +138/-44 lines). Narrow, contained changes succeed.
Merge rate inversely correlates with PR size. Clusters with avg <20 files changed all sit ≥84% merge rate; the two clusters above 30 files drop to 73–74%.
Comment volume is a stress signal: AI Engine Integration averages 6.0 comments/PR vs ~2.6 elsewhere — these tasks need more back-and-forth.

Cluster Overview

Cluster	Tasks	%	Merge Rate	Avg Files	Avg +/-	Top Keywords
Workflow Compilation	304	30.4%	73.7%	30.4	+432/-239	md, daily, github, workflows, agent
Testing & Test Coverage	297	29.7%	86.2%	18.3	+557/-502	pkg, string, error, engine, pkg workflow
Safe Outputs	128	12.8%	85.9%	15.2	+308/-94	safe, pull, safe outputs, pull request, outputs
MCP Server / Protocol	92	9.2%	73.9%	48.2	+484/-326	mcp, cli, tool, server, tools
AI Engine Integration	72	7.2%	62.5%	93.0	+1020/-553	awf, v0, v0.25, awf config, config
Experiments / Frontmatter	65	6.5%	84.6%	9.4	+298/-54	experiment, experiments, bin, git, variant
Caching	42	4.2%	88.1%	10.5	+138/-44	cache, memory, cache memory, miss, state

Cluster details & representative PRs

Cluster 6 — Workflow Compilation (304 PRs, 73.7%)

Compilation/regeneration of agentic-workflow lock files, daily-workflow scaffolding, shared imports (e.g. observability-otlp.md), model alias updates. Highest volume, mid-low merge rate because lock-file regenerations are routinely superseded by newer commits before merge.

#30995 Import shared/observability-otlp.md in most agentic workflows (merged, 390 files)
#29399 Replace deprecated {{#import}} with {{#runtime-import}} (merged)
#29668 Add inline sub-agent syntax using ## agent: (merged)

Cluster 3 — Testing & Test Coverage (297 PRs, 86.2%)

Go-package fixes (pkg/workflow, engine, parsers), lint failures, error-string changes, CJS handler updates. Highest-volume "contained" cluster — clear scope, mostly merged.

#31484 Set frontmatter defaults and add shared import/expression (merged)
#31764 Add concurrency.queue support (merged)
#29031 Fix: detection job never fails in warn mode due to parse/inference (merged)

Cluster 0 — Safe Outputs (128 PRs, 85.9%)

Work on the safe-output surface: create_issue, create_pull_request, push-to-pull-request-branch, label-triggered jobs, permission gating. Narrow scope, high merge rate.

#31820 Use aw_context fallbacks for injected GitHub prompt context (merged)
#29270 Auto-inject create-issue safe output (merged)
#29269 Label-triggered jobs (merged)

Cluster 5 — MCP Server / Protocol (92 PRs, 73.9%)

MCP CLI mounting, gateway hardening, playwright integration, timeout/retry semantics. Large footprint (48 files avg) — refactors that touch many handler tables.

#28842 Remove features.mcp-cli flag (merged, 344 files)
#28448 Deprecate features.mcp-cli (closed, 369 files)
#29418 Convert 55.8% of agentic workflows to use tools.github mode (merged)

Cluster 2 — AI Engine Integration (72 PRs, 62.5%) ⚠️

Lowest merge rate. AWF version bumps (v0.25.x), Pi engine, OTel attribute capture, firewall token steering, retries on health-probe failures. High file count (93 avg) and comment volume (6.0/PR) — these PRs need iteration.

#31796 Bump AWF to v0.25.44 + firewall.effective-token-steering (merged)
#31789 Bump AWF + token steering/model muting (closed — superseded by Bump AWF to v0.25.44 and add firewall.effective-token-steering compiler support #31796)
#29789 Add Pi agentic engine (experimental) (merged)
#28956 Retry AWF invocation on intermittent awf-api-proxy (closed)

Cluster 1 — Experiments / Frontmatter (65 PRs, 84.6%)

A/B experiment frontmatter, variant scaffolding, shared experiments/ modules. Small, focused diffs.

#29534 Extend frontmatter with A/B experiments section (merged)
#29606 Add daily-experiment-report workflow (merged)
#29996 Add storage option to experiments (cache | repo) (merged)

Cluster 4 — Caching (42 PRs, 88.1%) ✅

Highest merge rate. Cache-memory miss detection, repo-memory sanitization, push gating. Smallest footprint (10 files avg), shortest review cycle (1.1 comments/PR).

#28516 Cache-memory cache_memory_miss detection & concurrency (merged)
#28434 Daily-cache-strategy-analyzer workflow (merged)
#28894 Gate push_repo_memory on agent success (merged)

Representative PR data table (top by files-changed per cluster)

PR	Cluster	Outcome	Files	+/-
#31820	Safe Outputs	Merged	232	+5670/-5424
#31835	Safe Outputs	Merged	229	+715/-579
#29958	Safe Outputs	Merged	211	+475/-475
#31065	Safe Outputs	Closed	186	+621/-621
#31001	Experiments / Frontmatter	Closed	187	+809/-634
#28814	Experiments / Frontmatter	Merged	44	+1068/-415
#29397	Experiments / Frontmatter	Closed	38	+319/-300
#29534	Experiments / Frontmatter	Merged	29	+1700/-36
#31796	AI Engine Integration	Merged	274	+3385/-3069
#31808	AI Engine Integration	Merged	239	+710/-69
#31789	AI Engine Integration	Closed	237	+2921/-2595
#29789	AI Engine Integration	Merged	235	+2554/-1040
#31484	Testing & Test Coverage	Merged	236	+899/-722
#31764	Testing & Test Coverage	Merged	231	+412/-55
#30414	Testing & Test Coverage	Merged	226	+280/-281
#31203	Testing & Test Coverage	Closed	226	+865/-765
#28516	Caching	Merged	94	+272/-36
#30499	Caching	Closed	93	+205/-116
#28777	Caching	Closed	39	+667/-4
#28894	Caching	Merged	38	+95/-51
#28448	MCP Server / Protocol	Closed	369	+3171/-2714
#28842	MCP Server / Protocol	Merged	344	+5268/-3103
#31136	MCP Server / Protocol	Closed	247	+1101/-823
#31706	MCP Server / Protocol	Closed	223	+348/-60
#30995	Workflow Compilation	Merged	390	+7821/-3265
#29399	Workflow Compilation	Merged	287	+386/-568
#29668	Workflow Compilation	Merged	243	+2001/-144
#30739	Workflow Compilation	Merged	225	+328/-300

Methodology & caveats

Input: 1000 most-recent Copilot-authored PRs (full PR JSON dump in /tmp/gh-aw/prompt-cache/pr-full-data/).
Prompt signal: PR title + first ~2000 chars of PR body (stripped of code fences, firewall warning blocks, URLs, markdown formatting). The PR body is the agent's description of work done, not the literal original prompt — the analysis therefore characterizes the distribution of tasks the agent ends up working on, which is the actionable view.
Vectorizer: TF-IDF, 1-2 grams, max_features=800, min_df=3, max_df=0.6, English stop-words plus a domain stop-list (pr, copilot, gh, aw, fix, feat, chore, refactor, test, docs, wip).
Clustering: K-means with n_init=20. Swept k=3..10; silhouette scores were tight (0.015–0.032). Picked k=7 as the best score that still produced human-interpretable labels.
Cluster labels: heuristic mapping from top TF-IDF terms to a topic name (rule table in /tmp/gh-aw/agent/cluster.py). Verified by manually inspecting the top-3 PRs of each cluster.
Workflow turn metrics: not joined in this run — log enrichment from gh-aw logs --engine copilot is left for a follow-up pass. Comment count is used as a noisy proxy for iteration count.
Low silhouette score (0.032) reflects that the prompt space is genuinely continuous — these are all gh-aw tasks with overlapping vocabulary. The clusters are still useful for prioritization but should be read as soft groupings, not hard partitions.

Recommendations

Investigate the AWF/AI Engine Integration cluster (62.5% merge rate) — it's the only cluster below 75%. Multiple PRs are superseded mid-flight (e.g. Bump AWF to v0.25.44 and expose token steering/model multipliers in compiler frontmatter #31789 → Bump AWF to v0.25.44 and add firewall.effective-token-steering compiler support #31796). Consider whether the prompt for AWF version bumps should pin to one open PR per version, or whether the workflow should self-detect that an open Copilot PR for the same bump already exists.
Workflow Compilation churn — 304 PRs (30% of all Copilot work) are lock-file regenerations. Many close unmerged because newer source changes obsolete them. Consider squashing the regenerator into a daily-batch workflow rather than per-source-change PRs, or auto-closing superseded regenerations.
Comment volume signal — AI Engine Integration averages 2.3× the comments/PR of other clusters. This is a leading indicator of "prompt needs more upfront context." Adding a checklist (release notes link, breaking-change scan, integration-test snapshot) to the AWF-bump prompt template would likely cut review cycles.
The Caching cluster is a model template — small (10 files), clear scope, 88% merge. Patterns in those prompts (narrow file scope, single concern per PR, explicit success criteria) should be lifted into prompt guidance for the noisier clusters.
Re-run with workflow turn metrics — joining aw_info.json turn counts per PR would let us replace the comment-count proxy with a real iteration count and identify which clusters burn the most agent turns.

References:

Workflow run §25793566521

Generated by Copilot Agent Prompt Clustering Analysis · ● 11.2M · ◷

expires on May 14, 2026, 10:58 AM UTC

2026-05-14T11:21:24Z

github-actions[bot]
Bot May 14, 2026
Author

This discussion was automatically closed because it expired on 2026-05-14T10:58:26.347Z.

Closed by Workflow

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-13 #31919

Uh oh!

{{title}}

Uh oh!

Cluster 6 — Workflow Compilation (304 PRs, 73.7%)

Cluster 3 — Testing & Test Coverage (297 PRs, 86.2%)

Cluster 0 — Safe Outputs (128 PRs, 85.9%)

Cluster 5 — MCP Server / Protocol (92 PRs, 73.9%)

Cluster 2 — AI Engine Integration (72 PRs, 62.5%) ⚠️

Cluster 1 — Experiments / Frontmatter (65 PRs, 84.6%)

Cluster 4 — Caching (42 PRs, 88.1%) ✅

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-13 #31919

Uh oh!

github-actions[bot] Bot May 13, 2026

Summary

Key Findings

Cluster Overview

Cluster 6 — Workflow Compilation (304 PRs, 73.7%)

Cluster 3 — Testing & Test Coverage (297 PRs, 86.2%)

Cluster 0 — Safe Outputs (128 PRs, 85.9%)

Cluster 5 — MCP Server / Protocol (92 PRs, 73.9%)

Cluster 2 — AI Engine Integration (72 PRs, 62.5%) ⚠️

Cluster 1 — Experiments / Frontmatter (65 PRs, 84.6%)

Cluster 4 — Caching (42 PRs, 88.1%) ✅

Recommendations

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 14, 2026 Author

github-actions[bot]
Bot May 13, 2026

github-actions[bot]
Bot May 14, 2026
Author