feat: add experimental BinEval evaluation support by Copilot · Pull Request #42100 · github/gh-aw

Copilot · 2026-06-28T14:47:18Z

Adds native BinEval-style evaluations to gh-aw — small, binary questions declared in workflow frontmatter, executed post-run via an LLM harness, with results aggregated and reported as CI artifacts.

Schema (`evals` frontmatter)

New optional evals array with id + question fields; validated for unique IDs and non-empty questions
Emits experimental warning at compile time

evals:
  - id: builds
    question: Does the generated code compile?
  - id: focused
    question: Is the implementation limited to the requested change?

Evaluation model

EvalDefinition, EvalResult, EvalSummary types in frontmatter_types.go
WorkflowData.Evals []EvalDefinition for downstream consumers

Eval job

New eval job injected after agent + detection jobs in the compiled workflow
JS harness (eval_harness.cjs) calls GitHub Models API (gpt-4o-mini) per question independently — no MCPs, no checkout
Prompt generation produces per-question binary prompts with rationale; no holistic scoring
Results aggregated (total/passed/failed/pass-rate) and uploaded as a eval artifact with a markdown step summary

Not included

Phase 8 (persisting results to a git branch, à la experiments) is deferred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

pelikhan · 2026-06-28T16:56:38Z

@copilot run inference in AWF and use /reflect to find a suitable inference endpoint.

github-actions · 2026-06-28T17:28:05Z

Hey @Copilot 👋 — great work on the BinEval evaluation support! The end-to-end shape is clean: the evals frontmatter schema, the Go compiler module, the JS harness, and the wiring into the orchestrator all land as a cohesive unit.

A few things that stand out positively:

Well-scoped: every changed file is in service of the same feature — nothing unrelated was mixed in.
Tests at every layer: compiler_evals_test.go (Go unit tests), eval_harness.test.cjs (JS unit tests), and evals_experimental_warning_test.go (integration guard) give solid coverage across the stack.
Clear description: the PR body explains the schema, the evaluation model, the harness design, and explicitly calls out what was deferred (Phase 8 persistence) — exactly the context reviewers need.
Experimental gating: emitting a compile-time warning via emitExperimentalFeatureWarnings is the right pattern for a feature that isn't production-ready yet.

This looks ready for review. 🚀

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

patchdiff.githubusercontent.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "patchdiff.githubusercontent.com"

See Network Configuration for more information.

Generated by ✅ Contribution Check · 293.1 AIC · ⌖ 20.4 AIC · ⊞ 6K · ◷

- Remove direct GitHub Models API calls from eval_harness.cjs; keep only shared utility functions (readEvalSpec, buildEvalPrompt, aggregateResults, renderMarkdownSummary, sanitizeEvalError) - Add actions/setup/md/eval.md: eval prompt template instructing the engine to output EVAL_RESULT:{...json...} - Add actions/setup/js/setup_eval.cjs: prompt setup script (mirrors setup_threat_detection.cjs) - Add actions/setup/js/parse_eval_results.cjs: result parser extracting EVAL_RESULT from engine log - Update compiler_evals.go: eval job now follows the detection job pattern — pulls AWF containers, clears MCP config, installs the agentic engine, runs it inside AWF, parses results - Add EvalLogPath, EvalDir, DefaultEvalMaxAICredits constants to pkg/constants Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

pelikhan · 2026-06-28T17:51:20Z

@copilot keep working

Add missing test files for the two BinEval JS modules introduced in the AWF-engine refactor: - setup_eval.test.cjs: 14 tests covering prompt template rendering, missing/empty context files, eval spec parsing, and step summary output - parse_eval_results.test.cjs: 25 tests covering EVAL_RESULT extraction from plain and stream-json logs, main() error/success paths, and result normalisation Mirrors the test coverage pattern of setup_threat_detection.test.cjs and parse_threat_detection_results.test.cjs. Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

github-actions · 2026-06-28T18:45:05Z

🤖 PR Triage — §28332039983

Field	Value
Category	`feature`
Risk	`high`
Priority	`medium`
Score	38 / 100 (impact 25 · urgency 5 · quality 8)
Action	`defer`
Batch	—

Rationale: Large experimental draft (+2 181 lines). Adds a BinEval-style LLM evaluation harness in workflow frontmatter — significant new capability but high scope/risk. No CI results yet (merge state: UNSTABLE). Defer until draft is promoted to ready and CI passes.

Labels applied: pr-type:feature · pr-risk:high · pr-priority:medium · pr-action:defer · pr-agent:copilot-swe-agent

Generated by 🔧 PR Triage Agent · 65.9 AIC · ⌖ 11.5 AIC · ⊞ 5.4K · ◷

github-actions · 2026-06-29T01:34:22Z

🤖 PR Triage — §28342769269

Field	Value
Category	`feature`
Risk	`high`
Score	38/100
Priority	low
Action	`defer`
Status	Draft — no CI, no reviews

Score breakdown: Impact 25 · Urgency 6 · Quality 7

Rationale: Experimental BinEval LLM evaluation harness (+2181 lines). Large addition, still draft with no CI checks passing. ~10h old with no new activity. Defer until promoted to ready and CI established.

i️ Carried over. Defer until draft promoted to ready.

Generated by 🔧 PR Triage Agent · 107.8 AIC · ⌖ 10.9 AIC · ⊞ 5.4K · ◷

github-actions · 2026-06-29T08:19:45Z

🤖 PR Triage — §28357644191

Field	Value
Category	`feature`
Risk	🔴 High
Score	30/100 (Impact 20 · Urgency 5 · Quality 5)
Action	`defer`

Carried over — 17.3h old. Experimental BinEval LLM evaluation harness (+2181 lines). Large addition, no CI yet, draft. Defer until promoted to ready and CI validates.

Generated by 🔧 PR Triage Agent · 89.9 AIC · ⌖ 12 AIC · ⊞ 5.4K · ◷

github-actions · 2026-06-29T14:00:26Z

🤖 PR Triage — §28376613466

Field	Value
Category	`feature` (experimental)
Risk	🔴 High
Priority	🟢 Low
Score	30 / 100
Action	⏸️ `defer`
Age	23h

Score breakdown: Impact 20 + Urgency 5 + Quality 5

Rationale: Adds native BinEval-style evaluations — large experimental addition (2181+/0−, 19 files), draft, no CI. High risk, no reviewer engagement yet. Defer until promoted from draft, CI passes, and feature scope is scoped down or approved.

i️ pr-priority:medium label is stale — score is 30 (low boundary). Carried over from §28357644191.

Generated by 🔧 PR Triage Agent · 93.2 AIC · ⌖ 14.6 AIC · ⊞ 5.4K · ◷

github-actions · 2026-06-29T19:05:56Z

🔍 PR Triage — §28395315609

Field	Value
Category	`feature`
Risk	`high`
Score	26 / 100
Action	`defer`
Batch	—

Score breakdown: impact 15 + urgency 3 + quality 8

Carried over (28h). Experimental BinEval evaluation support. Large draft (19 files, +2181/-0), no CI, Phase 8 deferred by author. Priority corrected: pr-priority:medium → pr-priority:low (score 26). Defer until out of draft with CI.

Generated by 🔧 PR Triage Agent · 99.1 AIC · ⌖ 11.6 AIC · ⊞ 5.4K · ◷

Add experimental BinEval evaluation support

867e943

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI assigned Copilot and pelikhan Jun 28, 2026

Copilot created this pull request from a session on behalf of pelikhan June 28, 2026 14:47 View session

pelikhan reviewed Jun 28, 2026

View reviewed changes

Comment thread actions/setup/js/eval_harness.cjs Outdated

Copilot started work on behalf of pelikhan June 28, 2026 14:50 View session

Copilot finished work on behalf of pelikhan June 28, 2026 15:19

Copilot AI requested a review from pelikhan June 28, 2026 15:19

Copilot started work on behalf of pelikhan June 28, 2026 16:57 View session

This comment has been minimized.

Sign in to view

Copilot stopped work on behalf of pelikhan due to an error June 28, 2026 17:49
Copilot integration ID is required

Merge branch 'main' into copilot/add-eval-support-gh-aw

6eada22

Copilot started work on behalf of pelikhan June 28, 2026 17:58 View session

Copilot finished work on behalf of pelikhan June 28, 2026 18:30

github-actions Bot added pr-action:defer pr-agent:copilot-swe-agent pr-priority:medium pr-risk:high pr-type:feature labels Jun 28, 2026

github-actions Bot mentioned this pull request Jun 28, 2026

[PR Triage Report] PR Triage Report — Agent PRs — 2026-06-28 Run §28332039983 #42121

Closed

github-actions Bot mentioned this pull request Jun 29, 2026

[PR Triage Report] Agent PRs — 2026-06-29 Run §28342769269 #42157

Closed

github-actions Bot mentioned this pull request Jun 29, 2026

[PR Triage Report] 🤖 PR Triage Report — 2026-06-29 · Run §28357644191 #42198

Closed

github-actions Bot mentioned this pull request Jun 29, 2026

[aw] No-Op Runs #39849

Open

github-actions Bot mentioned this pull request Jun 29, 2026

[PR Triage Report] Agent PR Triage Report — 2026-06-29 Run §28376613466 #42251

Closed

github-actions Bot added the pr-priority:low label Jun 29, 2026

github-actions Bot mentioned this pull request Jun 29, 2026

[PR Triage Report] 🤖 PR Triage Report — 2026-06-29 (Run §28395315609) #42327

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add experimental BinEval evaluation support#42100

feat: add experimental BinEval evaluation support#42100
Copilot wants to merge 4 commits into
mainfrom
copilot/add-eval-support-gh-aw

Copilot AI commented Jun 28, 2026

Uh oh!

Uh oh!

pelikhan commented Jun 28, 2026

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 28, 2026

Uh oh!

pelikhan commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Copilot AI commented Jun 28, 2026

Schema (evals frontmatter)

Evaluation model

Eval job

Not included

Uh oh!

Uh oh!

pelikhan commented Jun 28, 2026

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 28, 2026

Uh oh!

pelikhan commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

🤖 PR Triage — §28332039983

Uh oh!

github-actions Bot commented Jun 29, 2026

🤖 PR Triage — §28342769269

Uh oh!

github-actions Bot commented Jun 29, 2026

🤖 PR Triage — §28357644191

Uh oh!

github-actions Bot commented Jun 29, 2026

🤖 PR Triage — §28376613466

Uh oh!

github-actions Bot commented Jun 29, 2026

🔍 PR Triage — §28395315609

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Schema (`evals` frontmatter)