feat: add agents/ orchestration framework for autonomous bug fixing and feature building#1124
feat: add agents/ orchestration framework for autonomous bug fixing and feature building#1124aidandaly24 wants to merge 12 commits intomainfrom
Conversation
Adds a self-contained Python project for autonomous agents powered by Bedrock AgentCore Harness. Includes: - core/ — shared harness client (raw HTTP + SigV4), response parsing, config - orchestrations/fix_and_review/ — multi-phase pipeline: plan → execute → verify → multi-round review → fix → PR - bug_fixer/ — workflow entry point for fixing issues labeled 'bug' - feature_builder/ — workflow entry point for building features from devex + impl docs - pr_reviewer/ — migrated from .github/harness/ to share core infrastructure - GitHub Actions workflows for both triggers - 19 unit tests Tested end-to-end: successfully planned, implemented, and reviewed fixes for issues #761 and #924 with Opus 4.7, creating PRs with proper templates.
|
|
||
| from core.config import PipelineConfig | ||
| from core.harness_client import HarnessClient | ||
| from core.parsing import Finding |
| from orchestrations.fix_and_review.phases.aggregate import run_aggregate | ||
| from orchestrations.fix_and_review.phases.complete import run_complete | ||
| from orchestrations.fix_and_review.phases.execute import run_execute | ||
| from orchestrations.fix_and_review.phases.extract import ExtractResult, run_extract |
| import os | ||
| import tempfile | ||
|
|
||
| import pytest |
| @@ -0,0 +1,81 @@ | |||
| import pytest | |||
|
|
|||
| from core.parsing import Finding, ReviewResult, parse_reviewer_output | |||
| @@ -0,0 +1,74 @@ | |||
| import pytest | |||
| from orchestrations.fix_and_review.partitioning import ( | ||
| DiffStats, | ||
| ReviewerAssignment, | ||
| calculate_reviewer_count, | ||
| partition_round1_by_directory, | ||
| partition_round2_focus_prompts, | ||
| partition_round3_risk_areas, | ||
| ) |
Coverage Report
|
There was a problem hiding this comment.
Workflow sets HARNESS_ARN secret but code never reads it; hardcoded personal ARN will be used in CI instead
Both .github/workflows/bug-fixer.yml and .github/workflows/feature-builder.yml export HARNESS_ARN: ${{ secrets.HARNESS_ARN }} as an env var, but nothing in agents/ reads environment variables — PipelineConfig only loads from config.yaml, and the workflows don't pass --harness-arn.
The hardcoded value in agents/config.yaml is arn:aws:bedrock-agentcore:us-west-2:603141041947:harness/IssueSolver_aidandal-8SL97TEXjS — a personal developer harness (and a personal AWS account ID checked into the repo). When these workflows run in CI they will always hit that personal harness and ignore the secret.
Fix options:
- Have
PipelineConfig.from_yaml(orPipelineConfigitself) readHARNESS_ARN/AWS_PROFILE/etc. from env vars, with env taking precedence over YAML. - Pass
--harness-arn "$HARNESS_ARN"on theuv run python -m ...line in each workflow. - Remove the account-specific ARN from the committed
config.yaml(leave it as a placeholder/empty) and require env/flag override in non-local runs.
Option 1 is probably cleanest since it also solves the aws_profile issue below.
| self.session = boto3.Session( | ||
| region_name=config.region, | ||
| profile_name=config.aws_profile, | ||
| ) |
There was a problem hiding this comment.
aws_profile="deploy" default will crash under GitHub Actions OIDC
Both workflows use aws-actions/configure-aws-credentials with role-to-assume — this sets AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN env vars, not a named profile. But PipelineConfig defaults aws_profile to "deploy", and HarnessClient.__init__ passes it unconditionally:
self.session = boto3.Session(region_name=config.region, profile_name=config.aws_profile)In CI there is no deploy profile, so boto3 will raise ProfileNotFound the moment either agent starts.
Fix options:
- Treat
aws_profileas optional (e.g.Nonedefault) and only passprofile_name=when it's set, so env-var credentials flow through naturally. - Read
AWS_PROFILEfrom env and default toNonerather than"deploy". - Pass an explicit
--aws-profileoverride from the workflow (but there's no profile to point at in GH Actions, so this doesn't really work).
Option 1 or 2 is needed for the workflows to run at all.
| version = "0.1.0" | ||
| description = "Add your description here" | ||
| readme = "README.md" | ||
| requires-python = ">=3.14" |
There was a problem hiding this comment.
Python version mismatch: requires-python = ">=3.14" vs workflow using Python 3.12
agents/pyproject.toml declares requires-python = ">=3.14", but both .github/workflows/bug-fixer.yml and .github/workflows/feature-builder.yml use actions/setup-python@v6 with python-version: '3.12'.
uv sync will refuse to use 3.12 for a project that requires 3.14. uv may auto-download a 3.14 interpreter, but this is fragile and 3.14 only released Oct 2025 — this is almost certainly unintended.
Fix options:
- Drop the
requires-pythonconstraint to>=3.12(matches the workflow and the rest of your dependencies' support). - Bump the workflow to
python-version: '3.14'.
Option 1 is safer unless you actually depend on a 3.14-only feature.
| 3. Run tests with summary: `npm run test:unit 2>&1 | grep -E "(FAIL|PASS|Tests:|Test Suites:)" | tail -20` | ||
| 4. If tests fail, debug the specific file: `npm run test:unit -- path/to/failing.test.ts 2>&1 | tail -50` | ||
| 5. Commit your changes: `git add -A && git commit -m "feat: {commit_message}"` | ||
| 6. Push to remote: `git push origin feature/{feature_name}` |
There was a problem hiding this comment.
{feature_name} placeholder will cause a KeyError — feature_builder Phase 2 is broken
This template references {feature_name} on line 12, but run_execute in orchestrations/fix_and_review/phases/execute.py only passes plan, commit_message, and branch_name to load_prompt("executor.md", ...). str.format will raise KeyError: 'feature_name' the first time the feature_builder pipeline hits Phase 2: Execute.
The PR description says this was tested end-to-end on issues #761 and #924, but those are bug-fix cases — the feature_builder path hasn't actually exercised this prompt.
Fix options:
- Change line 12 to use
{branch_name}instead offeature/{feature_name}(matchesbug_fixer/prompts/executor.mdand is whatrun_executealready supplies). - Pass
feature_name=feature_name or ""through from the orchestrator intorun_executeand intoload_prompt.
Option 1 is the minimal fix.
| if exit_code == 0 and stdout.strip(): | ||
| pr_urls.append(stdout.strip()) | ||
| else: | ||
| errors.append(f"Failed to create PR in {repo}: {stderr[:500]}") |
There was a problem hiding this comment.
Duplicate PR-URL append + stale stderr/exit_code/stdout in the else branch
Lines 96–109 have two separate bugs from what looks like a bad merge / dead code that wasn't deleted:
if url_match:
pr_urls.append(url_match.group(0))
else:
stdout, _, _ = client.run_command(
session_id, f"cd {repo_name} && gh pr list --head {branch_name} ..."
)
if stdout.strip():
pr_urls.append(stdout.strip())
else:
errors.append(f"PR may have been created in {repo} but could not extract URL")
if exit_code == 0 and stdout.strip(): # <-- stale vars from previous iterations
pr_urls.append(stdout.strip()) # <-- double-append on success path
else:
errors.append(f"Failed to create PR in {repo}: {stderr[:500]}") # <-- stale stderr from push- On the success path (
url_matchwas found),stdout/exit_codestill hold values from the previous loop iteration'sgit pushat line 64. If that happens to beexit_code == 0and non-empty stdout, you append a bogus URL; otherwise you log"Failed to create PR"withstderrfrom agit push, even though the PR actually succeeded. - On the fallback path (gh pr list), you also re-append
stdouta second time via the trailing block — double URLs in the result.
Fix: delete lines 106–109 entirely. The if url_match / else block above already handles both branches.
| continue | ||
| test_cmd = TEST_COMMANDS.get(repo, "npm test") | ||
| print(f" Running tests in {repo} (may take a few minutes)...", flush=True) | ||
| _, stderr, exit_code = client.run_command(session_id, f'cd {repo} && {test_cmd} 2>&1 | grep -E "(FAIL|PASS|Tests:|Test Suites:)" | tail -20') |
There was a problem hiding this comment.
Typecheck/test exit_code is from tail, not from the command — verification will almost always falsely pass
Both the typecheck and test commands pipe through tail (and in the test case, also grep):
_, stderr, exit_code = client.run_command(session_id, f"cd {repo} && npm run typecheck 2>&1 | tail -5")
...
_, stderr, exit_code = client.run_command(session_id, f'cd {repo} && {test_cmd} 2>&1 | grep -E "(FAIL|PASS|...)" | tail -20')In a POSIX shell without set -o pipefail, the pipeline's exit status is the exit status of the last command (tail), which is essentially always 0. That means typecheck_passes and tests_pass will be set to True even when npm run typecheck / npm run test:unit fail — defeating the whole point of Phase 2.5.
Also note stderr here is always empty because of 2>&1, so the error message in errors.append(f"...: {stderr[:500]}") is useless even when the check does catch a failure.
Fix options:
- Prefix each command with
set -o pipefail;(bash only — not guaranteed in plainsh). - Run the command twice: once to capture real exit code (
npm run typecheck > /tmp/tc.log 2>&1), thentail -5 /tmp/tc.logto get the tail for display. - Write to a file and check
$?explicitly:npm run typecheck > /tmp/tc.log 2>&1; rc=$?; tail -5 /tmp/tc.log; exit $rc.
Option 3 is the most shell-portable.
| if "agentcore-l3-cdk" in plan.lower() or "cdk" in plan.lower(): | ||
| affected_repos.append("agentcore-l3-cdk-constructs") | ||
| if not affected_repos: | ||
| affected_repos = ["agentcore-cli"] |
There was a problem hiding this comment.
affected_repos detection is effectively always both repos
if "agentcore-cli" in plan.lower() or "cli" in plan.lower():
affected_repos.append("agentcore-cli")
if "agentcore-l3-cdk" in plan.lower() or "cdk" in plan.lower():
affected_repos.append("agentcore-l3-cdk-constructs")The substring checks "cli" in plan.lower() and "cdk" in plan.lower() will match virtually any plan the LLM produces — plans routinely mention "CLI", "CDK", "agentcore-cli", etc. even when only one repo is actually touched. So affected_repos is almost always ["agentcore-cli", "agentcore-l3-cdk-constructs"], which then causes run_verify / run_complete / run_extract to cd into both repos, run tests in both, try to push branches in both, etc.
Downstream phases partially compensate by checking git diff main --stat before running tests/push (in verify.py), but run_extract does not — it runs git diff main in whatever the current directory is and will miss changes in the other repo.
Fix options:
- Stop guessing from the plan text. Instead, detect affected repos directly by asking the harness
cd <repo> && git log main..HEAD --onelinein each known repo and keeping those with commits. - Ask the planner to emit a structured
affected_repos: [cli, cdk]list (e.g. a fenced JSON block) and parse it, rather than substring-matching prose. - Keep a fixed list of both repos and let the per-repo
git diffguards inverify.pyfilter — but then also fixrun_extractto aggregate diffs from both repos.
Option 1 is the most robust since it measures ground truth.
Critical: - Remove stale variables in complete.py causing duplicate PR URLs High: - Add input validation in feature-builder.yml (path traversal, command injection) - Resolve AWS credentials per-request instead of freezing at construction - Use format_map with defaults to prevent KeyError on missing template vars - Capture test exit code separately from grep display in verify.py - Make JSON brace-depth counter string-aware in parsing.py - Gitignore config.yaml (contains account-specific ARN), add config.yaml.example - Guard against empty changed_files in partition_round1_by_directory Medium: - Add type coercion for numeric overrides in orchestrator - Only push after all local checks pass in verify.py - Skip push when rebase fails in complete.py - Lower Python requirement to >=3.12 - Widen boto3/botocore version constraints
…let CI handle full suite
| test_files.append(changed) | ||
| else: | ||
| # Look for adjacent test file | ||
| test_candidate = changed.replace("/src/", "/src/").replace(".ts", ".test.ts") |
…ce-with-lease for push
| ) | ||
|
|
||
| # Phase 8: Complete | ||
| t0 = time.time() |
…ix double serialization
| def invoke( | ||
| self, | ||
| session_id: str, | ||
| message: str, | ||
| system_prompt: str | None = None, | ||
| max_iterations: int | None = None, | ||
| verbose: bool = True, | ||
| retries: int = 2, | ||
| ) -> str: |
…ration rabbit holes
…chestrator and agent
Description
Adds a self-contained Python project (
agents/) for autonomous agents powered by Bedrock AgentCore Harness. This provides the shared orchestration infrastructure that the team's frontier week agents will build on.What's included:
agents/core/— shared harness client (raw HTTP + SigV4), response parsing, configagents/orchestrations/fix_and_review/— multi-phase pipeline: plan → execute → verify → multi-round review → fix → PRagents/bug_fixer/— workflow: issue labeledbug→ agent plans fix → implements → reviews → PRsagents/feature_builder/— workflow: devex doc + impl plan → agent builds feature → reviews → PRsagents/pr_reviewer/— migrated from.github/harness/to share core infrastructureTested end-to-end: Successfully planned, implemented, and reviewed fixes for issues #761 and #924 with Opus 4.7, creating PRs with proper templates through 3 rounds of multi-agent review.
Architecture: See the proposal doc (linked in Quip) for full details on the layered design: workflows → orchestrations → phases → core.
Related Issue
Part of Frontier Week: CLI/SDK autonomous agents initiative.
Type of Change
Testing
npm run test:unitandnpm run test:integnpm run typechecknpm run lintThe
agents/directory has its own test suite:cd agents && uv sync && uv run pytest tests/ -v(19 tests passing).Checklist