Skip to content

codev: per-task complexity / per-protocol effort with cross-agent translation (complexity/* labels + harness flag emission) #998

@amrmelsayed

Description

@amrmelsayed

Problem

afx spawn currently launches the builder with whatever model and effort the local agent CLI defaults to. There's no way for codev to express that a SPIR cycle needs deeper reasoning than an AIR cycle, or that this specific issue warrants more thinking than the protocol default. The HarnessProvider abstraction at packages/codev/src/agent-farm/utils/harness.ts injects only the role / system prompt; effort selection is invisible to codev.

This issue introduces a stable codev-side complexity dimension that translates to each agent's native effort flag (Claude's --effort, Codex's reasoning effort, Gemini's thinking budget) without binding codev's vocabulary to any one tool.

Three-layer design

Layer Vocabulary Purpose
GitHub label (issue metadata) complexity/<level> Describes a property of the task. Pairs with area/* (same metadata-on-issue convention).
Codev internal (config, CLI flag, harness interface) complexity The stable enum codev passes around. CLI flag --complexity, config section complexity:, harness API parameter complexity.
Per-agent CLI flag (each harness translates) each tool's native term claude --effort high, codex --reasoning-effort high, gemini --thinking ... (exact flags subject to verification at plan-gate).

The translation happens inside each HarnessProvider. Codev's enum stays stable; each provider speaks its own native dialect when invoking the underlying CLI.

Why complexity/* over effort/* for the label

Two reasons:

  1. Labels describe issues, not actions. complexity/high is a fact about the work. effort/high reads as an instruction to the framework. The first matches the area/* precedent (metadata about the issue); the second blurs the line between metadata and directive.
  2. Framework-neutral. No agent calls the abstract concept "effort" by universal agreement. Claude uses "effort", Codex uses "reasoning effort", Gemini uses "thinking". "Complexity" describes the task itself, which is constant regardless of which agent processes it.

Inside codev's code (config keys, CLI flags, harness parameter names), using complexity consistently matches the label and keeps the layered vocabulary clean.

Resolution chain

When a builder spawns, codev picks the complexity in this order (first match wins):

1. CLI flag                  afx spawn 42 --complexity max
2. GitHub label              complexity/<level> on the issue
3. .codev/config.json        complexity.<protocol> override
4. PROTOCOL_DEFAULTS         built-in per-protocol default

Per-protocol built-in defaults (recommended starting point):

const PROTOCOL_DEFAULTS: Record<string, Complexity> = {
  spir:       'high',
  aspir:      'high',
  pir:        'high',
  research:   'high',
  bugfix:     'medium',
  maintain:   'medium',
  experiment: 'medium',
  air:        'low',
};

Reasoning:

  • SPIR / ASPIR / PIR involve real design calls (plan and review phases) and benefit from depth.
  • RESEARCH is multi-agent investigation by nature.
  • BUGFIX / MAINTAIN / EXPERIMENT are typically more constrained but can have subtle diagnostic work.
  • AIR is by definition well-spec'd and mechanical; deeper reasoning is wasted budget.

Level granularity

Claude exposes five levels: low, medium, high, xhigh, max. The codev enum should match for full coverage:

export type Complexity = 'low' | 'medium' | 'high' | 'xhigh' | 'max';

But for user-facing labels, expose only three (complexity/low, complexity/medium, complexity/high). Reserve xhigh and max for the CLI flag and config override, where they're explicit power-user choices. This keeps the common case discoverable while leaving room for the rare "this design call really wants the deepest reasoning" override.

For agents whose native scale is narrower (suppose Codex offers only low / medium / high), the harness clamps: xhigh and max both map to that agent's highest available setting.

Implementation surface

Approximately 100-150 LOC across these files:

  • New: packages/codev/src/agent-farm/utils/complexity.ts (the enum + per-protocol policy + label-parser)
  • Modified: packages/codev/src/agent-farm/utils/harness.ts (extend HarnessProvider interface with optional complexity parameter; update CLAUDE_HARNESS, CODEX_HARNESS, GEMINI_HARNESS, OPENCODE_HARNESS to translate)
  • Modified: afx spawn CLI command (new --complexity flag)
  • Modified: .codev/config.json schema docs (new complexity section)
  • Modified: spawn pipeline call sites that invoke harness.buildRoleInjection() and harness.buildScriptRoleInjection() to pass through the resolved complexity
  • New tests: complexity-policy unit tests; per-harness translation tests; label-parser tests

Design calls for plan-approval

Real decisions worth pinning at the plan-gate rather than during implementation:

  1. Exact CLI flag and effort-equivalent for each provider. Claude's --effort is verified; Codex's and Gemini's effort-equivalent flags need to be confirmed against their current CLI surfaces. Plan should run claude --help / codex --help / gemini --help and document the exact translation table.
  2. Level mapping when agent scale is narrower than codev's. Recommended: xhigh and max both clamp to the agent's highest level. Plan should confirm this is the right policy versus erroring out (which would force the user to pick a level the agent supports).
  3. What happens for an issue with no complexity/* label? Recommended: fall through to the protocol default silently. No warning, no implicit complexity/medium auto-applied. Plan should confirm this is the desired UX rather than nagging the user to label.
  4. Should the label appear in the row prefix anywhere? The Builders tree's row prefix from vscode: Builders tree should group by phase (action axis), not area (domain axis); area becomes the row prefix #952 currently shows [<area>]. Adding [<complexity>] would crowd it. Recommended: complexity does NOT appear in any tree row prefix; the area prefix is the at-a-glance triage, and complexity matters at spawn time, not at scan time. Plan should confirm.
  5. Architect default. Should the architect terminal (long-running, design-heavy by nature) always launch at high regardless of any per-protocol logic? Recommended yes; the architect's role is more design-heavy on average than any single builder's work. Plan should confirm.
  6. Cost guardrails. Once teams can spawn at xhigh / max freely, the bills could surprise. Should afx spawn warn before spawning at max, or after N high-effort spawns in a day? Lightest version: no guardrail in v1; can add later. Heaviest: an opt-in soft cap in .codev/config.json. Plan should pick the v1 floor.
  7. Backward-compat default. Existing users who don't add the complexity section to their config should see no behavior change. Recommended: codev defaults to a no-op (don't emit any effort flag) until the user explicitly opts in via the config section or a label. This ensures the change is additive and reversible.

Acceptance

  • afx spawn 42 --protocol pir --complexity high passes --effort high to Claude (or the appropriate translation for Codex / Gemini / OpenCode).
  • An issue labeled complexity/high spawns at high complexity even without the CLI flag.
  • A repo with no .codev/config.json complexity section and no per-issue label uses the protocol default.
  • Each built-in harness (claude, codex, gemini, opencode) emits the correct provider-specific flag for each codev complexity level, verified by harness unit tests.
  • When a level isn't directly supported by an agent's CLI, the harness clamps to the nearest available level (does not crash, does not pass through unrecognised flags).
  • No regression for users who don't configure any complexity policy (existing spawns continue to work without effort-related flags being emitted, per the backward-compat default).

Out of scope

  • Model selection (Opus vs Sonnet vs Haiku). That's a separate axis from complexity. Models stay in shell.builder / shell.architect config as today. A high-complexity spawn on Sonnet is meaningful; mixing the two axes into one knob would be wrong.
  • Per-phase complexity adjustment. Effort locks at process spawn; switching mid-session isn't possible with the current Claude Code CLI. Complexity is calibrated once to the protocol's hardest phase.
  • A complexity:critical or similar fifth label level. Stick with three labels (low/medium/high) for discoverability. xhigh / max are reachable via the CLI flag for explicit power-user choices.

Related

  • packages/codev/src/agent-farm/utils/harness.ts — where the harness providers live (the load-bearing extension point).
  • area/* label discipline — the precedent this issue follows for the complexity/* label family.
  • Memory feedback_respect_harness_abstraction.md — the principle of routing builder-specific behavior through HarnessProvider, which this issue extends rather than bypasses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/towerArea: Tower server / agent farm CLI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions