feat(workflows): add beval behavioral evaluation workflow for dt-coach agent#1129
feat(workflows): add beval behavioral evaluation workflow for dt-coach agent#1129eedorenko wants to merge 78 commits into
Conversation
Add 30 test cases across 4 categories (coaching behaviors, session phases, method guidance, progressive hints) with ACP judge integration. Include reusable CI workflow and PR validation hook with fork guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed port specification from agent startup command.
Add prompt to copilot agent startup command.
Added working-directory to Start agent step in beval.yml
Switch to init_prompt to reliably activate the dt-coach agent in ACP sessions. Remove --agent flag from copilot TCP start, add port-readiness polling. Add agent identity verification case. Copy dt-coach.agent.md to .github/agents/ for flat discovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pin actions/checkout, actions/setup-python, and actions/upload-artifact to SHA hashes to satisfy hve-core dependency pinning policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes "Directory path must be absolute: ." error from copilot agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add model to agent.yaml and eval.config.yaml connection config so it is applied via set_session_model. Remove --model from workflow CLI args. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove branch pin from beval pip install so it uses the default branch of the vyta/beval repo instead of eedorenko/skill-agent. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add beval, wireframes, parseable to cspell dictionary - Ignore beval/results/** from spell check (generated output) - Add top-level and job-level permissions blocks to test-token.yml Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Add behavioral evaluation job to release-stable.yml - Remove test-token.yml debug workflow - Remove dt-coach.agent.md (not part of this contribution) - Remove beval/results/ (generated output, not for source control) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Run npm audit fix to update flatted to a non-vulnerable version. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
WilliamBerryiii
left a comment
There was a problem hiding this comment.
Thank you for this PR, @eedorenko. Behavioral evaluation for the dt-coach agent is a valuable addition, and we appreciate the effort to formalize agent quality testing with structured evaluation cases.
After reviewing the workflow changes against our CI security standards, we've identified several issues that need to be resolved before this can merge. The findings fall into two categories: supply-chain security violations in the beval workflow, and architectural concerns with integrating it into PR validation and release pipelines.
Important
The combination of unpinned dependencies from an external personal repository, unpinned npm range installs, inherited secrets, and persisted credentials creates a compound risk. A compromise of any one dependency effectively grants access to all repository secrets and the CI execution context.
We've added inline comments on each affected file with specific context and suggested changes. The critical items are:
pip installfromvyta/bevalwith no commit SHA and no hash verification (see comment onbeval.ymlline 32)npm install -g @github/copilot@1with a major-version range and no lockfile (see comment onbeval.ymlline 29)actions/checkoutwithoutpersist-credentials: false(see comment onbeval.ymlline 21)- Both copilot instances launch with
--allow-all, granting unrestricted permissions (see comment onbeval.ymlline 36) secrets: inheritin both calling workflows forwards all repository secrets when onlyCOPILOT_TOKENis needed- Behavioral evaluation should not gate PR merges or releases at this stage (see comments on
pr-validation.ymlandrelease-stable.yml)
Our repository enforces these standards through Test-DependencyPinning.ps1, Test-WorkflowPermissions.ps1, and the conventions documented in workflows.instructions.md. The copilot-setup-steps.yml workflow demonstrates the expected pattern for downloading and verifying external binaries.
We recommend deploying beval as a standalone workflow_dispatch or scheduled workflow instead of integrating it into pr-validation.yml and release-stable.yml. This allows behavioral testing to proceed without gating contributor workflows or release processes.
Please comment if you have questions about any of the suggestions, and we can discuss further.
The missing comma after copilot-win32-x64 caused it to be concatenated with pkg:npm/hve-core into a single invalid entry, so the dependency review check rejected the copilot-win32-x64 license.
chaosdinosaur
left a comment
There was a problem hiding this comment.
Solid addition of behavioral evaluation for the DT Coach agent — the test cases are well-crafted and closely aligned with the agent's Think/Speak/Empower philosophy. SHA pinning on Actions and the beval install are good.
A few items to address: cspell ignore path mismatch with actual results path, missing concurrency block per repo conventions, and the personal-repo supply chain consideration for the beval dependency. Minor: cspell word ordering and lockfile noise from merge churn.
- Add concurrency block to beval.yml per repo conventions - Add supply-chain context comment on beval personal-repo install - Fix cspell ignorePaths to match actual results output path - Sort cspell words list alphabetically - Reset package.json and package-lock.json to main to remove merge churn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts in .cspell.json (keep both beval and behavioural, deduplicate smol, maintain alphabetical order), take upstream versions of package.json and package-lock.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@chaosdinosaur thank you for the review! I addressed your comments, please review. |
chaosdinosaur
left a comment
There was a problem hiding this comment.
Reviewed the current diff against repo conventions. Most prior review feedback has been addressed. Three new items:
- Rebase needed — action SHAs/versions have drifted from main (checkout label, upload-artifact version)
- Missing
.gitignore— beval results directory should be gitignored likeevals/results/ - Directory consolidation — consider moving
beval/underevals/to co-locate with the existing Vally evaluation framework
| AGENT_REPO_ROOT: ${{ github.workspace }} | ||
|
|
||
| steps: | ||
| - name: Checkout repository |
There was a problem hiding this comment.
This branch has drifted from main. Two action references are now inconsistent with the repository standard:
-
actions/checkout— the SHAde0fac2e...is correct but the version comment says# v4.2.2. All other workflows now label this same SHA as# v6.0.2. -
actions/upload-artifact— uses@bbbca2ddaa5d...(v4.4.3) but the repo standard is now@043fb46d1a93c...(v7.0.1).
Both will likely fail the action-version-consistency-scan workflow. A rebase onto current main should resolve these automatically.
| "CHANGELOG.md", | ||
| "logs/**", | ||
| "docs/docusaurus/build/**" | ||
| "docs/docusaurus/build/**", |
There was a problem hiding this comment.
🔍 Missing .gitignore entry for beval results
The beval/dt-coach/results/ directory is not gitignored. While CI generates results as uploaded artifacts, local runs of beval would produce results.json that could be accidentally committed. The existing pattern for the Vally framework ignores evals/results/ — same convention should apply here.
Suggested addition to .gitignore:
# Beval evaluation results
beval/**/results/| @@ -0,0 +1,20 @@ | |||
| eval: | |||
There was a problem hiding this comment.
💡 Consider consolidating with existing evals/ directory
The repo already has a structured evals/ directory with agent-behavior evaluations using @microsoft/vally-cli (see evals/agent-behavior/eval.yaml). This PR introduces a parallel top-level beval/ directory for a similar purpose — evaluating agent behavior.
Could the beval cases and configuration be placed under evals/ (e.g., evals/beval/dt-coach/) to keep all evaluation artifacts co-located? This would:
- Make it easier for contributors to discover all evaluation suites in one place
- Share
.gitignorepatterns (evals/results/is already ignored) - Align with the existing project structure documented in
evals/README.md
The Vally and beval toolchains can coexist under the same parent directory even if their config formats differ.
Description
Adds a behavioral evaluation (beval) CI workflow for the
dt-coachagent using GitHub Copilot CLI over ACP (TCP). The workflow:beval/cases/Also pins all GitHub Actions dependencies to SHA hashes for supply chain security, and installs beval from the default branch of the
vyta/bevalrepo.Sample eval run: https://github.com/eedorenko/hve-core/actions/runs/23311489579/job/67799722616
Related Issue(s)
Type of Change
Code & Documentation:
Infrastructure & Configuration:
AI Artifacts:
prompt-builderagent and addressed all feedback.github/instructions/*.instructions.md).github/prompts/*.prompt.md).github/agents/*.agent.md).github/skills/*/SKILL.md)Other:
.ps1,.sh,.py)Testing
The workflow has been validated by triggering it manually via
workflow_dispatch. All 30 evaluation cases passed with an overall score of 0.81.Checklist
Required Checks
AI Artifact Contributions
/prompt-analyzeto review contributionprompt-builderreviewRequired Automated Checks
npm run lint:mdnpm run spell-checknpm run lint:frontmatternpm run validate:skillsnpm run lint:md-linksnpm run lint:psnpm run plugin:generateSecurity Considerations
Additional Notes
All GitHub Actions
uses:steps are pinned to SHA hashes per supply chain security best practices.