Skip to content

feat: add experimental BinEval evaluation support#42100

Draft
Copilot wants to merge 4 commits into
mainfrom
copilot/add-eval-support-gh-aw
Draft

feat: add experimental BinEval evaluation support#42100
Copilot wants to merge 4 commits into
mainfrom
copilot/add-eval-support-gh-aw

Conversation

Copilot AI commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Adds native BinEval-style evaluations to gh-aw — small, binary questions declared in workflow frontmatter, executed post-run via an LLM harness, with results aggregated and reported as CI artifacts.

Schema (evals frontmatter)

  • New optional evals array with id + question fields; validated for unique IDs and non-empty questions
  • Emits experimental warning at compile time
evals:
  - id: builds
    question: Does the generated code compile?
  - id: focused
    question: Is the implementation limited to the requested change?

Evaluation model

  • EvalDefinition, EvalResult, EvalSummary types in frontmatter_types.go
  • WorkflowData.Evals []EvalDefinition for downstream consumers

Eval job

  • New eval job injected after agent + detection jobs in the compiled workflow
  • JS harness (eval_harness.cjs) calls GitHub Models API (gpt-4o-mini) per question independently — no MCPs, no checkout
  • Prompt generation produces per-question binary prompts with rationale; no holistic scoring
  • Results aggregated (total/passed/failed/pass-rate) and uploaded as a eval artifact with a markdown step summary

Not included

Phase 8 (persisting results to a git branch, à la experiments) is deferred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Comment thread actions/setup/js/eval_harness.cjs Outdated
@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot run inference in AWF and use /reflect to find a suitable inference endpoint.

@github-actions

This comment has been minimized.

@github-actions

Copy link
Copy Markdown
Contributor

Hey @Copilot 👋 — great work on the BinEval evaluation support! The end-to-end shape is clean: the evals frontmatter schema, the Go compiler module, the JS harness, and the wiring into the orchestrator all land as a cohesive unit.

A few things that stand out positively:

  • Well-scoped: every changed file is in service of the same feature — nothing unrelated was mixed in.
  • Tests at every layer: compiler_evals_test.go (Go unit tests), eval_harness.test.cjs (JS unit tests), and evals_experimental_warning_test.go (integration guard) give solid coverage across the stack.
  • Clear description: the PR body explains the schema, the evaluation model, the harness design, and explicitly calls out what was deferred (Phase 8 persistence) — exactly the context reviewers need.
  • Experimental gating: emitting a compile-time warning via emitExperimentalFeatureWarnings is the right pattern for a feature that isn't production-ready yet.

This looks ready for review. 🚀

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • patchdiff.githubusercontent.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "patchdiff.githubusercontent.com"

See Network Configuration for more information.

Generated by ✅ Contribution Check · 293.1 AIC · ⌖ 20.4 AIC · ⊞ 6K ·

- Remove direct GitHub Models API calls from eval_harness.cjs; keep only shared utility functions (readEvalSpec, buildEvalPrompt, aggregateResults, renderMarkdownSummary, sanitizeEvalError)
- Add actions/setup/md/eval.md: eval prompt template instructing the engine to output EVAL_RESULT:{...json...}
- Add actions/setup/js/setup_eval.cjs: prompt setup script (mirrors setup_threat_detection.cjs)
- Add actions/setup/js/parse_eval_results.cjs: result parser extracting EVAL_RESULT from engine log
- Update compiler_evals.go: eval job now follows the detection job pattern — pulls AWF containers, clears MCP config, installs the agentic engine, runs it inside AWF, parses results
- Add EvalLogPath, EvalDir, DefaultEvalMaxAICredits constants to pkg/constants

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot stopped work on behalf of pelikhan due to an error June 28, 2026 17:49
@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot keep working

Add missing test files for the two BinEval JS modules introduced in the
AWF-engine refactor:

- setup_eval.test.cjs: 14 tests covering prompt template rendering,
  missing/empty context files, eval spec parsing, and step summary output
- parse_eval_results.test.cjs: 25 tests covering EVAL_RESULT extraction
  from plain and stream-json logs, main() error/success paths, and result
  normalisation

Mirrors the test coverage pattern of setup_threat_detection.test.cjs and
parse_threat_detection_results.test.cjs.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28332039983

Field Value
Category feature
Risk high
Priority medium
Score 38 / 100 (impact 25 · urgency 5 · quality 8)
Action defer
Batch

Rationale: Large experimental draft (+2 181 lines). Adds a BinEval-style LLM evaluation harness in workflow frontmatter — significant new capability but high scope/risk. No CI results yet (merge state: UNSTABLE). Defer until draft is promoted to ready and CI passes.

Labels applied: pr-type:feature · pr-risk:high · pr-priority:medium · pr-action:defer · pr-agent:copilot-swe-agent

Generated by 🔧 PR Triage Agent · 65.9 AIC · ⌖ 11.5 AIC · ⊞ 5.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28342769269

Field Value
Category feature
Risk high
Score 38/100
Priority low
Action defer
Status Draft — no CI, no reviews

Score breakdown: Impact 25 · Urgency 6 · Quality 7

Rationale: Experimental BinEval LLM evaluation harness (+2181 lines). Large addition, still draft with no CI checks passing. ~10h old with no new activity. Defer until promoted to ready and CI established.

i️ Carried over. Defer until draft promoted to ready.

Generated by 🔧 PR Triage Agent · 107.8 AIC · ⌖ 10.9 AIC · ⊞ 5.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28357644191

Field Value
Category feature
Risk 🔴 High
Score 30/100 (Impact 20 · Urgency 5 · Quality 5)
Action defer

Carried over — 17.3h old. Experimental BinEval LLM evaluation harness (+2181 lines). Large addition, no CI yet, draft. Defer until promoted to ready and CI validates.

Generated by 🔧 PR Triage Agent · 89.9 AIC · ⌖ 12 AIC · ⊞ 5.4K ·

@github-actions github-actions Bot mentioned this pull request Jun 29, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28376613466

Field Value
Category feature (experimental)
Risk 🔴 High
Priority 🟢 Low
Score 30 / 100
Action ⏸️ defer
Age 23h

Score breakdown: Impact 20 + Urgency 5 + Quality 5

Rationale: Adds native BinEval-style evaluations — large experimental addition (2181+/0−, 19 files), draft, no CI. High risk, no reviewer engagement yet. Defer until promoted from draft, CI passes, and feature scope is scoped down or approved.

i️ pr-priority:medium label is stale — score is 30 (low boundary). Carried over from §28357644191.

Generated by 🔧 PR Triage Agent · 93.2 AIC · ⌖ 14.6 AIC · ⊞ 5.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

🔍 PR Triage — §28395315609

Field Value
Category feature
Risk high
Score 26 / 100
Action defer
Batch

Score breakdown: impact 15 + urgency 3 + quality 8

Carried over (28h). Experimental BinEval evaluation support. Large draft (19 files, +2181/-0), no CI, Phase 8 deferred by author. Priority corrected: pr-priority:mediumpr-priority:low (score 26). Defer until out of draft with CI.

Generated by 🔧 PR Triage Agent · 99.1 AIC · ⌖ 11.6 AIC · ⊞ 5.4K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants