Skip to content

[grafana-otel-advisor] OTel improvement: propagate gh-aw.engine.id to all setup spans #32297

@github-actions

Description

@github-actions

OTel Instrumentation Improvement: propagate gh-aw.engine.id to all setup spans

Analysis Date: 2026-05-15
Priority: Medium
Effort: Small (< 2h)

Problem

The gh-aw.<job>.setup span emitted by actions/setup/js/send_otlp_span.cjs (sendJobSetupSpan) is missing the gh-aw.engine.id attribute on every job. The conclusion span carries gh-aw.engine.id because by then /tmp/gh-aw/aw_info.json has been written, but at setup time that file does not exist yet and no GH_AW_INFO_ENGINE_ID env var is set on the setup step.

A DevOps engineer cannot answer: “What is the p95 setup latency for Claude vs Codex vs Copilot workflows?” or “Which engine is most likely to fail before reaching the agent step?” — because every setup span is unlabelled with respect to the engine.

Why This Matters (DevOps Perspective)
  • Setup latency is workload-shaped (MCP install, firewall config, sandbox boot). Per-engine breakdowns are the first thing on-call wants when activation gets slow.
  • Setup failures (MCP/firewall/auth) often happen before any agent step writes aw_info.json, so the conclusion span’s engine attribute is the only carrier today — but conclusion spans are not always emitted on early aborts (e.g. timeouts in MCP install), leaving a class of failures with no engine.id anywhere in the trace.
  • Activation and safe-outputs jobs do not produce aw_info.json themselves; their setup AND conclusion spans both lack engine grouping today. Adding the attribute to the setup step env fixes both at once.
  • Unblocks Grafana / Honeycomb / Datadog panels grouped by gh-aw.engine.id for the setup phase, lowering MTTR on “my Claude workflow is slow today” incidents.
Current Behavior

send_otlp_span.cjs resolves the engine ID with this priority:

// actions/setup/js/send_otlp_span.cjs:176-178
function resolveEngineId(awInfo) {
  return readContextString(awInfo.engine_id)
      || readContextString(awInfo.context?.engine_id)
      || process.env.GH_AW_INFO_ENGINE_ID
      || "";
}

At the time the setup span is sent:

  1. /tmp/gh-aw/aw_info.json does not exist yet (it is written by generate_aw_info.cjs later in the agent step).
  2. process.env.GH_AW_INFO_ENGINE_ID is not set on the setup step, because the compiler only emits it on the Generate agentic run info step:
// pkg/workflow/compiler_yaml.go:797-801
yaml.WriteString("      - name: Generate agentic run info\n")
yaml.WriteString("        id: generate_aw_info\n")
yaml.WriteString("        env:\n")
fmt.Fprintf(yaml, "          GH_AW_INFO_ENGINE_ID: \"%s\"\n", engineID)

The setup step in compiler_yaml_step_generation.go:185-198 sets GH_AW_SETUP_WORKFLOW_NAME, GH_AW_CURRENT_WORKFLOW_REF, GH_AW_INFO_VERSION, and GH_AW_INFO_BODY_MODIFIED — but not GH_AW_INFO_ENGINE_ID.

Result: every gh-aw.<job>.setup span ships without gh-aw.engine.id.

Proposed Change

Propagate the resolved engine ID to the setup step env in pkg/workflow/compiler_yaml_step_generation.go, both in dev/release mode and in script mode. Since generateSetupStep does not currently receive the engine ID, pass it through or read it from data the same way generateCreateAwInfo does.

// pkg/workflow/compiler_yaml_step_generation.go (dev/release mode, ~line 185)
lines = append(lines,
    "        env:\n",
    fmt.Sprintf("          GH_AW_SETUP_WORKFLOW_NAME: %q\n", data.Name),
    fmt.Sprintf("          GH_AW_CURRENT_WORKFLOW_REF: %s\n", buildSetupWorkflowRefExpr(data)),
)
if engineID := resolveEngineIDForSetup(data); engineID != "" {
    lines = append(lines, fmt.Sprintf("          GH_AW_INFO_ENGINE_ID: %q\n", engineID))
}

Add the same line in the script-mode branch (~line 143). The helper mirrors what generateCreateAwInfo already does:

func resolveEngineIDForSetup(data *WorkflowData) string {
    if data == nil { return "" }
    if data.EngineConfig != nil && data.EngineConfig.ID != "" {
        return data.EngineConfig.ID
    }
    if data.AI != "" { return data.AI }
    return ""
}

No change is required in send_otlp_span.cjsresolveEngineId already reads process.env.GH_AW_INFO_ENGINE_ID as a fallback, but that branch is unreachable today because the env var is never set on the setup step.

Expected Outcome

After this change:

  • In Grafana / Honeycomb / Datadog: gh-aw.engine.id = "claude" matches setup spans, enabling p95 by gh-aw.engine.id on the setup phase and per-engine error rate panels that include pre-agent failures.
  • In the JSONL mirror: the very first span of every job (the setup span in /tmp/gh-aw/otel.jsonl) carries gh-aw.engine.id, matching what the conclusion span already provides.
  • For on-call engineers: when an MCP install or firewall configuration fails during setup (no aw_info.json ever written), the trace still identifies which engine the workflow targets — instead of an unattributable setup failure.
Implementation Steps
  • Add resolveEngineIDForSetup(data) helper in pkg/workflow/compiler_yaml_step_generation.go (mirroring the engine-ID resolution in generateCreateAwInfo).
  • Emit GH_AW_INFO_ENGINE_ID: %q in both branches of generateSetupStep (script mode around line 143, dev/release mode around line 185).
  • Add or extend a Go test in pkg/workflow/ (e.g. alongside compiler_jobs_test.go) asserting the lock file contains GH_AW_INFO_ENGINE_ID: inside the setup step block for each engine.
  • Add a JS test in actions/setup/js/send_otlp_span.test.cjs asserting that gh-aw.engine.id is present on the setup span when only GH_AW_INFO_ENGINE_ID env var is set (no aw_info.json). The fallback path already exists at send_otlp_span.cjs:177; the test makes the wiring durable.
  • Run make test-unit (or cd actions/setup/js && npx vitest run) and go test ./pkg/workflow/... to confirm.
  • Run make fmt.
  • Open a PR referencing this issue.
Evidence from Live Grafana Data

Grafana Cloud Tempo (grafanacloud-mnkiefer-traces, datasource UID grafanacloud-traces) returned no traces over the past 7 days for any TraceQL query ({}, {resource.service.name="gh-aw"}) — suggesting an export-side issue worth a separate investigation, but the conclusion below was instead grounded in the live JSONL mirror that the same instrumentation writes locally during this very run.

Sample setup span from /tmp/gh-aw/otel.jsonl written by this advisor run (workflow run §25902326163):

{
  "traceId": "c2c8ebecda224fa682e57bc5c3b5e0a7",
  "spanId": "410e8ff541f646af",
  "parentSpanId": "9a3cd09c6892fe77",
  "name": "gh-aw.agent.setup",
  "attributes": [
    { "key": "gh-aw.job.name",       "value": { "stringValue": "agent" } },
    { "key": "gh-aw.workflow.name",  "value": { "stringValue": "Daily Grafana OTel Instrumentation Advisor" } },
    { "key": "gh-aw.run.id",         "value": { "stringValue": "25902326163" } },
    { "key": "gh-aw.event_name",     "value": { "stringValue": "schedule" } },
    { "key": "gh-aw.staged",         "value": { "boolValue": false } },
    { "key": "gh-aw.episode.id",     "value": { "stringValue": "25902326163-1:..." } }
    /* NOTE: gh-aw.engine.id is MISSING */
  ]
}

Meanwhile /tmp/gh-aw/aw_info.json (written later in the same job) clearly resolves the engine:

{ "engine_id": "claude", "engine_name": "Claude Code", ... }

Resource attributes on the same span are fine (service.version, github.repository, github.run_id, github.event_name, deployment.environment all present) — the missing field is specifically gh-aw.engine.id at the span-attribute layer of the setup span.

Related Files
  • pkg/workflow/compiler_yaml_step_generation.go — emit GH_AW_INFO_ENGINE_ID env var on the setup step (both script and dev/release branches)
  • pkg/workflow/compiler_yaml.go:718-725 — engine-ID resolution to mirror
  • actions/setup/js/send_otlp_span.cjs:176-178 — existing fallback that consumes the env var (no change needed)
  • actions/setup/js/send_otlp_span.test.cjs — add coverage for the env-only setup path
  • pkg/workflow/compiler_jobs_test.go — add lock-file assertion that the setup step contains GH_AW_INFO_ENGINE_ID:

Generated by the Daily Grafana OTel Instrumentation Advisor workflow

Generated by 📊 Daily Grafana OTel Instrumentation Advisor · ● 20.7M ·

  • expires on May 22, 2026, 5:47 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions