feat: add agents/ orchestration framework for autonomous bug fixing and feature building by aidandaly24 · Pull Request #1124 · aws/agentcore-cli

aidandaly24 · 2026-05-05T19:47:53Z

Description

Adds a self-contained Python project (agents/) for autonomous agents powered by Bedrock AgentCore Harness. This provides the shared orchestration infrastructure that the team's frontier week agents will build on.

What's included:

agents/core/ — shared harness client (raw HTTP + SigV4), response parsing, config
agents/orchestrations/fix_and_review/ — multi-phase pipeline: plan → execute → verify → multi-round review → fix → PR
agents/bug_fixer/ — workflow: issue labeled bug → agent plans fix → implements → reviews → PRs
agents/feature_builder/ — workflow: devex doc + impl plan → agent builds feature → reviews → PRs
agents/pr_reviewer/ — migrated from .github/harness/ to share core infrastructure
GitHub Actions workflows for both triggers
19 unit tests

Tested end-to-end: Successfully planned, implemented, and reviewed fixes for issues #761 and #924 with Opus 4.7, creating PRs with proper templates through 3 rounds of multi-agent review.

Architecture: See the proposal doc (linked in Quip) for full details on the layered design: workflows → orchestrations → phases → core.

Related Issue

Part of Frontier Week: CLI/SDK autonomous agents initiative.

Type of Change

New feature

Testing

I ran npm run test:unit and npm run test:integ
I ran npm run typecheck
I ran npm run lint

The agents/ directory has its own test suite: cd agents && uv sync && uv run pytest tests/ -v (19 tests passing).

Checklist

I have added any necessary tests that prove my fix is effective or my feature works
My changes generate no new warnings

Adds a self-contained Python project for autonomous agents powered by Bedrock AgentCore Harness. Includes: - core/ — shared harness client (raw HTTP + SigV4), response parsing, config - orchestrations/fix_and_review/ — multi-phase pipeline: plan → execute → verify → multi-round review → fix → PR - bug_fixer/ — workflow entry point for fixing issues labeled 'bug' - feature_builder/ — workflow entry point for building features from devex + impl docs - pr_reviewer/ — migrated from .github/harness/ to share core infrastructure - GitHub Actions workflows for both triggers - 19 unit tests Tested end-to-end: successfully planned, implemented, and reviewed fixes for issues #761 and #924 with Opus 4.7, creating PRs with proper templates.

+
+from core.config import PipelineConfig
+from core.harness_client import HarnessClient
+from core.parsing import Finding


+from orchestrations.fix_and_review.phases.aggregate import run_aggregate
+from orchestrations.fix_and_review.phases.complete import run_complete
+from orchestrations.fix_and_review.phases.execute import run_execute
+from orchestrations.fix_and_review.phases.extract import ExtractResult, run_extract


+import os
+import tempfile
+
+import pytest


@@ -0,0 +1,81 @@
+import pytest
+
+from core.parsing import Finding, ReviewResult, parse_reviewer_output


@@ -0,0 +1,74 @@
+import pytest


+from orchestrations.fix_and_review.partitioning import (
+    DiffStats,
+    ReviewerAssignment,
+    calculate_reviewer_count,
+    partition_round1_by_directory,
+    partition_round2_focus_prompts,
+    partition_round3_risk_areas,
+)


github-actions · 2026-05-05T20:05:21Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	43.1%	9015 / 20912
🔵	Statements	42.39%	9573 / 22582
🔵	Functions	39.91%	1553 / 3891
🔵	Branches	39.98%	5808 / 14527

Generated in workflow #2560 for commit 7026fd9 by the Vitest Coverage Report Action

agentcore-cli-automation · 2026-05-05T20:07:04Z

Workflow sets HARNESS_ARN secret but code never reads it; hardcoded personal ARN will be used in CI instead

Both .github/workflows/bug-fixer.yml and .github/workflows/feature-builder.yml export HARNESS_ARN: ${{ secrets.HARNESS_ARN }} as an env var, but nothing in agents/ reads environment variables — PipelineConfig only loads from config.yaml, and the workflows don't pass --harness-arn.

The hardcoded value in agents/config.yaml is arn:aws:bedrock-agentcore:us-west-2:603141041947:harness/IssueSolver_aidandal-8SL97TEXjS — a personal developer harness (and a personal AWS account ID checked into the repo). When these workflows run in CI they will always hit that personal harness and ignore the secret.

Fix options:

Have PipelineConfig.from_yaml (or PipelineConfig itself) read HARNESS_ARN/AWS_PROFILE/etc. from env vars, with env taking precedence over YAML.

Pass --harness-arn "$HARNESS_ARN" on the uv run python -m ... line in each workflow.

Remove the account-specific ARN from the committed config.yaml (leave it as a placeholder/empty) and require env/flag override in non-local runs.

Option 1 is probably cleanest since it also solves the aws_profile issue below.

agentcore-cli-automation · 2026-05-05T20:07:37Z

+        self.session = boto3.Session(
+            region_name=config.region,
+            profile_name=config.aws_profile,
+        )


aws_profile="deploy" default will crash under GitHub Actions OIDC

Both workflows use aws-actions/configure-aws-credentials with role-to-assume — this sets AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN env vars, not a named profile. But PipelineConfig defaults aws_profile to "deploy", and HarnessClient.__init__ passes it unconditionally:

self.session = boto3.Session(region_name=config.region, profile_name=config.aws_profile)

In CI there is no deploy profile, so boto3 will raise ProfileNotFound the moment either agent starts.

Fix options:

Treat aws_profile as optional (e.g. None default) and only pass profile_name= when it's set, so env-var credentials flow through naturally.

Read AWS_PROFILE from env and default to None rather than "deploy".

Pass an explicit --aws-profile override from the workflow (but there's no profile to point at in GH Actions, so this doesn't really work).

Option 1 or 2 is needed for the workflows to run at all.

agentcore-cli-automation · 2026-05-05T20:07:50Z

+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.14"


Python version mismatch: requires-python = ">=3.14" vs workflow using Python 3.12

agents/pyproject.toml declares requires-python = ">=3.14", but both .github/workflows/bug-fixer.yml and .github/workflows/feature-builder.yml use actions/setup-python@v6 with python-version: '3.12'.

uv sync will refuse to use 3.12 for a project that requires 3.14. uv may auto-download a 3.14 interpreter, but this is fragile and 3.14 only released Oct 2025 — this is almost certainly unintended.

Fix options:

Drop the requires-python constraint to >=3.12 (matches the workflow and the rest of your dependencies' support).

Bump the workflow to python-version: '3.14'.

Option 1 is safer unless you actually depend on a 3.14-only feature.

agentcore-cli-automation · 2026-05-05T20:08:00Z

+3. Run tests with summary: `npm run test:unit 2>&1 | grep -E "(FAIL|PASS|Tests:|Test Suites:)" | tail -20`
+4. If tests fail, debug the specific file: `npm run test:unit -- path/to/failing.test.ts 2>&1 | tail -50`
+5. Commit your changes: `git add -A && git commit -m "feat: {commit_message}"`
+6. Push to remote: `git push origin feature/{feature_name}`


{feature_name} placeholder will cause a KeyError — feature_builder Phase 2 is broken

This template references {feature_name} on line 12, but run_execute in orchestrations/fix_and_review/phases/execute.py only passes plan, commit_message, and branch_name to load_prompt("executor.md", ...). str.format will raise KeyError: 'feature_name' the first time the feature_builder pipeline hits Phase 2: Execute.

The PR description says this was tested end-to-end on issues #761 and #924, but those are bug-fix cases — the feature_builder path hasn't actually exercised this prompt.

Fix options:

Change line 12 to use {branch_name} instead of feature/{feature_name} (matches bug_fixer/prompts/executor.md and is what run_execute already supplies).

Pass feature_name=feature_name or "" through from the orchestrator into run_execute and into load_prompt.

Option 1 is the minimal fix.

agentcore-cli-automation · 2026-05-05T20:08:25Z

+        if exit_code == 0 and stdout.strip():
+            pr_urls.append(stdout.strip())
+        else:
+            errors.append(f"Failed to create PR in {repo}: {stderr[:500]}")


Duplicate PR-URL append + stale stderr/exit_code/stdout in the else branch

Lines 96–109 have two separate bugs from what looks like a bad merge / dead code that wasn't deleted:

if url_match: pr_urls.append(url_match.group(0)) else: stdout, _, _ = client.run_command( session_id, f"cd {repo_name} && gh pr list --head {branch_name} ..." ) if stdout.strip(): pr_urls.append(stdout.strip()) else: errors.append(f"PR may have been created in {repo} but could not extract URL") if exit_code == 0 and stdout.strip(): # <-- stale vars from previous iterations pr_urls.append(stdout.strip()) # <-- double-append on success path else: errors.append(f"Failed to create PR in {repo}: {stderr[:500]}") # <-- stale stderr from push

On the success path (url_match was found), stdout/exit_code still hold values from the previous loop iteration's git push at line 64. If that happens to be exit_code == 0 and non-empty stdout, you append a bogus URL; otherwise you log "Failed to create PR" with stderr from a git push, even though the PR actually succeeded.

On the fallback path (gh pr list), you also re-append stdout a second time via the trailing block — double URLs in the result.

Fix: delete lines 106–109 entirely. The if url_match / else block above already handles both branches.

agentcore-cli-automation · 2026-05-05T20:08:44Z

+            continue
+        test_cmd = TEST_COMMANDS.get(repo, "npm test")
+        print(f"  Running tests in {repo} (may take a few minutes)...", flush=True)
+        _, stderr, exit_code = client.run_command(session_id, f'cd {repo} && {test_cmd} 2>&1 | grep -E "(FAIL|PASS|Tests:|Test Suites:)" | tail -20')


Typecheck/test exit_code is from tail, not from the command — verification will almost always falsely pass

Both the typecheck and test commands pipe through tail (and in the test case, also grep):

_, stderr, exit_code = client.run_command(session_id, f"cd {repo} && npm run typecheck 2>&1 | tail -5") ... _, stderr, exit_code = client.run_command(session_id, f'cd {repo} && {test_cmd} 2>&1 | grep -E "(FAIL|PASS|...)" | tail -20')

In a POSIX shell without set -o pipefail, the pipeline's exit status is the exit status of the last command (tail), which is essentially always 0. That means typecheck_passes and tests_pass will be set to True even when npm run typecheck / npm run test:unit fail — defeating the whole point of Phase 2.5.

Also note stderr here is always empty because of 2>&1, so the error message in errors.append(f"...: {stderr[:500]}") is useless even when the check does catch a failure.

Fix options:

Prefix each command with set -o pipefail; (bash only — not guaranteed in plain sh).

Run the command twice: once to capture real exit code (npm run typecheck > /tmp/tc.log 2>&1), then tail -5 /tmp/tc.log to get the tail for display.

Write to a file and check $? explicitly: npm run typecheck > /tmp/tc.log 2>&1; rc=$?; tail -5 /tmp/tc.log; exit $rc.

Option 3 is the most shell-portable.

agentcore-cli-automation · 2026-05-05T20:09:03Z

+    if "agentcore-l3-cdk" in plan.lower() or "cdk" in plan.lower():
+        affected_repos.append("agentcore-l3-cdk-constructs")
+    if not affected_repos:
+        affected_repos = ["agentcore-cli"]


affected_repos detection is effectively always both repos

if "agentcore-cli" in plan.lower() or "cli" in plan.lower(): affected_repos.append("agentcore-cli") if "agentcore-l3-cdk" in plan.lower() or "cdk" in plan.lower(): affected_repos.append("agentcore-l3-cdk-constructs")

The substring checks "cli" in plan.lower() and "cdk" in plan.lower() will match virtually any plan the LLM produces — plans routinely mention "CLI", "CDK", "agentcore-cli", etc. even when only one repo is actually touched. So affected_repos is almost always ["agentcore-cli", "agentcore-l3-cdk-constructs"], which then causes run_verify / run_complete / run_extract to cd into both repos, run tests in both, try to push branches in both, etc.

Downstream phases partially compensate by checking git diff main --stat before running tests/push (in verify.py), but run_extract does not — it runs git diff main in whatever the current directory is and will miss changes in the other repo.

Fix options:

Stop guessing from the plan text. Instead, detect affected repos directly by asking the harness cd <repo> && git log main..HEAD --oneline in each known repo and keeping those with commits.

Ask the planner to emit a structured affected_repos: [cli, cdk] list (e.g. a fenced JSON block) and parse it, rather than substring-matching prose.

Keep a fixed list of both repos and let the per-repo git diff guards in verify.py filter — but then also fix run_extract to aggregate diffs from both repos.

Option 1 is the most robust since it measures ground truth.

Critical: - Remove stale variables in complete.py causing duplicate PR URLs High: - Add input validation in feature-builder.yml (path traversal, command injection) - Resolve AWS credentials per-request instead of freezing at construction - Use format_map with defaults to prevent KeyError on missing template vars - Capture test exit code separately from grep display in verify.py - Make JSON brace-depth counter string-aware in parsing.py - Gitignore config.yaml (contains account-specific ARN), add config.yaml.example - Guard against empty changed_files in partition_round1_by_directory Medium: - Add type coercion for numeric overrides in orchestrator - Only push after all local checks pass in verify.py - Skip push when rebase fails in complete.py - Lower Python requirement to >=3.12 - Widen boto3/botocore version constraints

…let CI handle full suite

+                test_files.append(changed)
+            else:
+                # Look for adjacent test file
+                test_candidate = changed.replace("/src/", "/src/").replace(".ts", ".test.ts")


…extraction bug

…ce-with-lease for push

…tract

+        )
+
+    # Phase 8: Complete
+    t0 = time.time()


…ix double serialization

+    def invoke(
+        self,
+        session_id: str,
+        message: str,
+        system_prompt: str | None = None,
+        max_iterations: int | None = None,
+        verbose: bool = True,
+        retries: int = 2,
+    ) -> str:


…ration rabbit holes

…chestrator and agent

…limit

…may be stale

aidandaly24 requested a review from a team May 5, 2026 19:47

chore: remove pr_reviewer from this PR — migration is a separate task

0b9208f

github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 5, 2026

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

github-actions Bot added the agentcore-harness-reviewing AgentCore Harness review in progress label May 5, 2026

agentcore-cli-automation reviewed May 5, 2026

View reviewed changes

github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label May 5, 2026

github-actions Bot added size/xl PR size: XL and removed size/xl PR size: XL labels May 5, 2026

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread agents/orchestrations/fix_and_review/orchestrator.py Fixed

perf: switch to targeted testing — only run tests for changed files, …

ff86d2d

…let CI handle full suite