feat(workflows): add beval behavioral evaluation workflow for dt-coach agent by eedorenko · Pull Request #1129 · microsoft/hve-core

eedorenko · 2026-03-19T19:36:09Z

Description

Adds a behavioral evaluation (beval) CI workflow for the dt-coach agent using GitHub Copilot CLI over ACP (TCP). The workflow:

Starts two Copilot CLI instances (agent on port 3000, judge on port 3001)
Runs beval evaluations against cases defined in beval/cases/
Uploads results as a workflow artifact

Also pins all GitHub Actions dependencies to SHA hashes for supply chain security, and installs beval from the default branch of the vyta/beval repo.

Sample eval run: https://github.com/eedorenko/hve-core/actions/runs/23311489579/job/67799722616

======================================================================
  SCORECARD
======================================================================
  Overall: 0.81  (30/30 cases passed)

  Metric                Score  Bar
  -------------------- ------  ------------
  latency               0.86  [#########-]
  quality               0.78  [########--]

  Case                                      Score  Status
  ---------------------------------------- ------  ------
  Response follows Think/Speak/Empower ...  0.66  + PASS
  Keep responses concise — no methodolo...  0.86  + PASS
  End with choices not directives           0.91  + PASS
  Work WITH users, not FOR them             0.95  + PASS
  Do not prescribe specific solutions t...  0.93  + PASS
  Stay curious and supportive when user...  0.81  + PASS
  Method 1: Assess whether request is f...  0.93  + PASS
  Method 1: Guide stakeholder identific...  0.82  + PASS
  Method 2: Help plan systematic research   0.68  + PASS
  Method 3: Guide pattern recognition f...  0.81  + PASS
  Method 4: Facilitate divergent ideation   0.65  + PASS
  Method 5: Guide concept creation for ...  0.65  + PASS
  Method 6: Encourage scrappy constrain...  0.86  + PASS
  Method 7: Guide technical feasibility...  0.65  + PASS
  Method 8: Structure user testing for ...  0.65  + PASS
  Method 9: Guide continuous optimizati...  0.63  + PASS
  Start with broad hints when user is s...  0.73  + PASS
  Escalate hints when user remains stuck    0.90  + PASS
  Accept backward transitions between m...  0.83  + PASS
  Announce method shifts transparently      0.84  + PASS
  Avoid multiple-choice question lists      0.81  + PASS
  Do not change method focus without an...  0.95  + PASS
  Resume session with state context         0.63  + PASS
  Ask for project slug during initializ...  0.94  + PASS
  Gather role, team, and method focus d...  0.92  + PASS
  Default to Method 1 for new projects      0.88  + PASS
  Ask targeted, open-ended questions du...  0.88  + PASS
  Summarize progress and check direction    0.78  + PASS
  Recap accomplishments and confirm met...  0.83  + PASS
  Summarize session and suggest next st...  0.89  + PASS

  Avg time: 38.5s
======================================================================

Related Issue(s)

Type of Change

Code & Documentation:

Bug fix (non-breaking change fixing an issue)
New feature (non-breaking change adding functionality)
Breaking change (fix or feature causing existing functionality to change)
Documentation update

Infrastructure & Configuration:

AI Artifacts:

Reviewed contribution with prompt-builder agent and addressed all feedback
Copilot instructions (.github/instructions/*.instructions.md)
Copilot prompt (.github/prompts/*.prompt.md)
Copilot agent (.github/agents/*.agent.md)
Copilot skill (.github/skills/*/SKILL.md)

Other:

Script/automation (.ps1, .sh, .py)
Other (please describe):

Testing

The workflow has been validated by triggering it manually via workflow_dispatch. All 30 evaluation cases passed with an overall score of 0.81.

Checklist

Required Checks

Documentation is updated (if applicable)
Files follow existing naming conventions
Changes are backwards compatible (if applicable)
Tests added for new functionality (if applicable)

AI Artifact Contributions

Used /prompt-analyze to review contribution
Addressed all feedback from prompt-builder review
Verified contribution follows common standards and type-specific requirements

Required Automated Checks

Markdown linting: npm run lint:md
Spell checking: npm run spell-check
Frontmatter validation: npm run lint:frontmatter
Skill structure validation: npm run validate:skills
Link validation: npm run lint:md-links
PowerShell analysis: npm run lint:ps
Plugin freshness: npm run plugin:generate

Security Considerations

This PR does not contain any sensitive or NDA information
Any new dependencies have been reviewed for security issues
Security-related scripts follow the principle of least privilege

Additional Notes

All GitHub Actions uses: steps are pinned to SHA hashes per supply chain security best practices.

Add 30 test cases across 4 categories (coaching behaviors, session phases, method guidance, progressive hints) with ACP judge integration. Include reusable CI workflow and PR validation hook with fork guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Removed port specification from agent startup command.

Add prompt to copilot agent startup command.

Added working-directory to Start agent step in beval.yml

Switch to init_prompt to reliably activate the dt-coach agent in ACP sessions. Remove --agent flag from copilot TCP start, add port-readiness polling. Add agent identity verification case. Copy dt-coach.agent.md to .github/agents/ for flat discovery. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pin actions/checkout, actions/setup-python, and actions/upload-artifact to SHA hashes to satisfy hve-core dependency pinning policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fixes "Directory path must be absolute: ." error from copilot agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add model to agent.yaml and eval.config.yaml connection config so it is applied via set_session_model. Remove --model from workflow CLI args. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Remove branch pin from beval pip install so it uses the default branch of the vyta/beval repo instead of eedorenko/skill-agent. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Add beval, wireframes, parseable to cspell dictionary - Ignore beval/results/** from spell check (generated output) - Add top-level and job-level permissions blocks to test-token.yml Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Add behavioral evaluation job to release-stable.yml - Remove test-token.yml debug workflow - Remove dt-coach.agent.md (not part of this contribution) - Remove beval/results/ (generated output, not for source control) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Run npm audit fix to update flatted to a non-vulnerable version. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko · 2026-03-19T21:01:28Z

FYI @vyta @bjcmit

WilliamBerryiii

Thank you for this PR, @eedorenko. Behavioral evaluation for the dt-coach agent is a valuable addition, and we appreciate the effort to formalize agent quality testing with structured evaluation cases.

After reviewing the workflow changes against our CI security standards, we've identified several issues that need to be resolved before this can merge. The findings fall into two categories: supply-chain security violations in the beval workflow, and architectural concerns with integrating it into PR validation and release pipelines.

Important

The combination of unpinned dependencies from an external personal repository, unpinned npm range installs, inherited secrets, and persisted credentials creates a compound risk. A compromise of any one dependency effectively grants access to all repository secrets and the CI execution context.

We've added inline comments on each affected file with specific context and suggested changes. The critical items are:

pip install from vyta/beval with no commit SHA and no hash verification (see comment on beval.yml line 32)
npm install -g @github/copilot@1 with a major-version range and no lockfile (see comment on beval.yml line 29)
actions/checkout without persist-credentials: false (see comment on beval.yml line 21)
Both copilot instances launch with --allow-all, granting unrestricted permissions (see comment on beval.yml line 36)
secrets: inherit in both calling workflows forwards all repository secrets when only COPILOT_TOKEN is needed
Behavioral evaluation should not gate PR merges or releases at this stage (see comments on pr-validation.yml and release-stable.yml)

Our repository enforces these standards through Test-DependencyPinning.ps1, Test-WorkflowPermissions.ps1, and the conventions documented in workflows.instructions.md. The copilot-setup-steps.yml workflow demonstrates the expected pattern for downloading and verifying external binaries.

We recommend deploying beval as a standalone workflow_dispatch or scheduled workflow instead of integrating it into pr-validation.yml and release-stable.yml. This allows behavioral testing to proceed without gating contributor workflows or release processes.

Please comment if you have questions about any of the suggestions, and we can discuss further.

The missing comma after copilot-win32-x64 caused it to be concatenated with pkg:npm/hve-core into a single invalid entry, so the dependency review check rejected the copilot-win32-x64 license.

chaosdinosaur

Solid addition of behavioral evaluation for the DT Coach agent — the test cases are well-crafted and closely aligned with the agent's Think/Speak/Empower philosophy. SHA pinning on Actions and the beval install are good.

A few items to address: cspell ignore path mismatch with actual results path, missing concurrency block per repo conventions, and the personal-repo supply chain consideration for the beval dependency. Minor: cspell word ordering and lockfile noise from merge churn.

- Add concurrency block to beval.yml per repo conventions - Add supply-chain context comment on beval personal-repo install - Fix cspell ignorePaths to match actual results output path - Sort cspell words list alphabetically - Reset package.json and package-lock.json to main to remove merge churn Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolve conflicts in .cspell.json (keep both beval and behavioural, deduplicate smol, maintain alphabetical order), take upstream versions of package.json and package-lock.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

eedorenko · 2026-05-12T20:10:34Z

@chaosdinosaur thank you for the review! I addressed your comments, please review.

chaosdinosaur

Reviewed the current diff against repo conventions. Most prior review feedback has been addressed. Three new items:

Rebase needed — action SHAs/versions have drifted from main (checkout label, upload-artifact version)
Missing .gitignore — beval results directory should be gitignored like evals/results/
Directory consolidation — consider moving beval/ under evals/ to co-locate with the existing Vally evaluation framework

chaosdinosaur · 2026-05-21T18:02:53Z

+      AGENT_REPO_ROOT: ${{ github.workspace }}
+
+    steps:
+      - name: Checkout repository


⚠️ Rebase needed — action versions are stale

This branch has drifted from main. Two action references are now inconsistent with the repository standard:

actions/checkout — the SHA de0fac2e... is correct but the version comment says # v4.2.2. All other workflows now label this same SHA as # v6.0.2.

actions/upload-artifact — uses @bbbca2ddaa5d... (v4.4.3) but the repo standard is now @043fb46d1a93c... (v7.0.1).

Both will likely fail the action-version-consistency-scan workflow. A rebase onto current main should resolve these automatically.

chaosdinosaur · 2026-05-21T18:03:05Z

    "CHANGELOG.md",
    "logs/**",
-    "docs/docusaurus/build/**"
+    "docs/docusaurus/build/**",


🔍 Missing .gitignore entry for beval results

The beval/dt-coach/results/ directory is not gitignored. While CI generates results as uploaded artifacts, local runs of beval would produce results.json that could be accidentally committed. The existing pattern for the Vally framework ignores evals/results/ — same convention should apply here.

Suggested addition to .gitignore:

# Beval evaluation results beval/**/results/

chaosdinosaur · 2026-05-21T18:03:17Z

@@ -0,0 +1,20 @@
+eval:


💡 Consider consolidating with existing evals/ directory

The repo already has a structured evals/ directory with agent-behavior evaluations using @microsoft/vally-cli (see evals/agent-behavior/eval.yaml). This PR introduces a parallel top-level beval/ directory for a similar purpose — evaluating agent behavior.

Could the beval cases and configuration be placed under evals/ (e.g., evals/beval/dt-coach/) to keep all evaluation artifacts co-located? This would:

Make it easier for contributors to discover all evaluation suites in one place

Share .gitignore patterns (evals/results/ is already ignored)

Align with the existing project structure documented in evals/README.md

The Vally and beval toolchains can coexist under the same parent directory even if their config formats differ.

eedorenko and others added 18 commits March 16, 2026 13:24

Update copilot command to use claude-opus model

ef56eae

Simplify agent startup command in beval.yml

8faa4ea

Removed port specification from agent startup command.

Modify copilot command to include prompt

50a03dd

Add prompt to copilot agent startup command.

Specify working directory for Start agent step

26fcbe7

Added working-directory to Start agent step in beval.yml

fix: pin GitHub Actions dependencies to SHA hashes

5a7ae11

Pin actions/checkout, actions/setup-python, and actions/upload-artifact to SHA hashes to satisfy hve-core dependency pinning policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: trigger beval workflow test

ade4c27

ci: add token debug workflow

c708932

ci: add token verification step to beval workflow

01849f7

ci: use claude-opus-4.6-fast model for agent and judge

de9e55e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: use claude-opus-4.6-1m model and add debug logging

859fa91

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: set AGENT_REPO_ROOT to absolute workspace path

7e5afbe

Fixes "Directory path must be absolute: ." error from copilot agent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: temporarily run only agent_identity case

967c680

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: set model via ACP session instead of CLI flag

3bf5071

Add model to agent.yaml and eval.config.yaml connection config so it is applied via set_session_model. Remove --model from workflow CLI args. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: remove token verification step and run full test suite

b00d3f0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: remove debug logging and agent_identity test case

4f1a9c2

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ci: install beval from default branch

fcaf374

Remove branch pin from beval pip install so it uses the default branch of the vyta/beval repo instead of eedorenko/skill-agent. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko requested a review from a team as a code owner March 19, 2026 19:36

Merge branch 'main' into eedorenko/beval

d662c71

github-advanced-security AI found potential problems Mar 19, 2026

View reviewed changes

Comment thread .github/workflows/test-token.yml Fixed

github-advanced-security AI found potential problems Mar 19, 2026

View reviewed changes

Comment thread .github/workflows/test-token.yml Fixed

github-advanced-security AI found potential problems Mar 19, 2026

View reviewed changes

Comment thread .github/workflows/test-token.yml Fixed

eedorenko marked this pull request as draft March 19, 2026 20:43

eedorenko and others added 2 commits March 19, 2026 13:45

fix: resolve flatted prototype pollution vulnerability

86b028e

Run npm audit fix to update flatted to a non-vulnerable version. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

eedorenko marked this pull request as ready for review March 19, 2026 21:00

WilliamBerryiii requested changes Mar 19, 2026

View reviewed changes

eedorenko added 16 commits April 1, 2026 13:51

Merge branch 'main' into eedorenko/beval

41796da

Merge branch 'main' into eedorenko/beval

9c256a4

fix: add missing comma in allow-dependencies-licenses list

b7035d4

The missing comma after copilot-win32-x64 caused it to be concatenated with pkg:npm/hve-core into a single invalid entry, so the dependency review check rejected the copilot-win32-x64 license.

Merge branch 'main' into eedorenko/beval

c5cb5e3

Merge branch 'main' into eedorenko/beval

292f7d6

Merge branch 'main' into eedorenko/beval

11fd4e7

Merge branch 'main' into eedorenko/beval

eedd6cf

Merge branch 'main' into eedorenko/beval

8a01cca

Merge branch 'main' into eedorenko/beval

033d010

Merge branch 'main' into eedorenko/beval

da126e6

Merge branch 'main' into eedorenko/beval

2e96685

Merge branch 'main' into eedorenko/beval

bb3a172

Merge branch 'main' into eedorenko/beval

ca91e7a

Merge branch 'main' into eedorenko/beval

d43c8dd

Merge branch 'main' into eedorenko/beval

75b41ce

Merge branch 'main' into eedorenko/beval

a39650a

chaosdinosaur reviewed Apr 22, 2026

View reviewed changes

Comment thread .github/workflows/beval.yml

Comment thread .github/workflows/beval.yml

Comment thread .cspell.json

Comment thread .cspell.json

Comment thread package-lock.json

WilliamBerryiii changed the title ~~ci: add beval behavioral evaluation workflow for dt-coach agent~~ feat(workflows): add beval behavioral evaluation workflow for dt-coach agent Apr 23, 2026

eedorenko and others added 2 commits April 23, 2026 13:36

eedorenko requested a review from chaosdinosaur April 24, 2026 00:15

eedorenko added 5 commits April 23, 2026 17:16

Merge branch 'main' into eedorenko/beval

36bcbd5

Merge branch 'main' into eedorenko/beval

ba171ef

Merge branch 'main' into eedorenko/beval

184e948

Merge branch 'main' into eedorenko/beval

e1dd72a

Merge branch 'main' into eedorenko/beval

787e28f

Merge branch 'main' into eedorenko/beval

c27b061

WilliamBerryiii approved these changes May 19, 2026

View reviewed changes

chaosdinosaur reviewed May 21, 2026

View reviewed changes

Conversation

eedorenko commented Mar 19, 2026

Description

Related Issue(s)

Type of Change

Testing

Checklist

Required Checks

AI Artifact Contributions

Required Automated Checks

Security Considerations

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eedorenko commented Mar 19, 2026

Uh oh!

WilliamBerryiii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chaosdinosaur left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eedorenko commented May 12, 2026

Uh oh!

chaosdinosaur left a comment

Choose a reason for hiding this comment

Uh oh!

chaosdinosaur May 21, 2026

Choose a reason for hiding this comment

Uh oh!

chaosdinosaur May 21, 2026

Choose a reason for hiding this comment

Uh oh!

chaosdinosaur May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants