Skip to content

[aw-failures] Smoke CI hard-red at startup — EACCES mkdir /tmp/gh-aw/sandbox/firewall/logs, agent never invoked (rootless left [Content truncated due to length] #42398

Description

@github-actions

Reclaim the rootless /tmp/gh-aw/sandbox tree before writeConfigs() — a leftover root-owned dir makes mkdir /tmp/gh-aw/sandbox/firewall/logs fail EACCES and kills Smoke CI at startup before the agent is ever invoked.

This is a NEW, untracked P1 hard-red. It is distinct from #41455 (firewall startup via DNS EAI_AGAIN), #41636 (Copilot CLI exit-1 after safe-outputs succeed), and #41885 (Claude parse step on empty logEntries). Those are DNS races or post-completion false-reds; this one fails pre-flight — the agent never runs, so the run is 100% lost.

Problem statement

Make the AWF sandbox bootstrap resilient to a pre-existing root-owned /tmp/gh-aw/sandbox left by a prior rootless container on the same runner. Today, the very first config-generation step dies:

[INFO] Network-isolation mode: enforcing egress via Docker network topology (no host iptables, no sudo).
[INFO] Generating configuration files...
[ERROR] Fatal error: Error: EACCES: permission denied, mkdir '/tmp/gh-aw/sandbox/firewall/logs'
    at Object.mkdirSync (node:fs:1370:26)
    ... at Object.RM [as writeConfigs] (/home/runner/.local/lib/awf/awf-bundle.js:786:1941)
[WARN] Could not fix squid log permissions: Error: Command failed ... chmod -R a+rX /tmp/gh-aw/sandbox/firewall/logs
chmod: cannot access '/tmp/gh-aw/sandbox/firewall/logs': Permission denied
Process exiting with code: 1
##[error]Process completed with exit code 1.

The recovery path (chmod -R a+rX) also fails Permission denied, so there is no escape hatch — the run aborts hard.

Affected workflows and run IDs

Workflow Engine Failed run Nearest success (same config)
Smoke CI (.github/workflows/smoke-ci.lock.yml) copilot §28413001230 (01:00:29Z) §28413042897 (01:01:35Z, ~1 min later)

A nearly-identical Smoke CI run succeeded ~1 minute later on the same config → this is a per-runner ownership race, not a configuration error.

Evidence

audit-diff: failed §28413001230 vs success §28413042897
{
  "firewall_diff": { "summary": { "has_anomalies": false, "anomaly_count": 0 } },
  "run_metrics_diff": {
    "run1_token_usage": 0, "run2_token_usage": 0,
    "github_rate_limit_details": { "run1_total_api_calls": 11, "run1_core_consumed": 65 }
  }
}
  • run1_token_usage: 0 — the agent was never invoked; the job died during "Generating configuration files".
  • has_anomalies: false — no firewall/egress divergence; the firewall config was never written because mkdir failed first. The discriminator is purely the pre-flight mkdir EACCES.

Probable root cause

  1. A prior rootless container on the same runner leaves /tmp/gh-aw/sandbox (the parent of firewall/logs) owned by a uid the current runner user cannot write — the same residue surfaced as "Rootless artifact permission repair failed (exit 1)" in [aw-failures] Copilot CLI false-red — runs marked failure (exit 1) after safe-outputs succeed, via "numerous permission denied" [Content truncated due to length] #41636 and [aw-failures] Claude false-red — log_parser_bootstrap fails completed runs on empty logEntries (Avenger, Daily Rendering Scripts [Content truncated due to length] #41885.
  2. writeConfigs() calls mkdirSync('/tmp/gh-aw/sandbox/firewall/logs') without first ensuring the tree is owned by / writable for the current uid → EACCES.
  3. The chmod -R a+rX fallback cannot touch the root-owned dir, so the bootstrap fatally exits 1 instead of repairing or relocating.

This is the pre-flight twin of #41885's post-teardown rootless-ownership failure: same root cause (rootless leaves root-owned /tmp/gh-aw/sandbox), opposite end of the run lifecycle.

Proposed remediation

  1. Primary: before writeConfigs(), reclaim the sandbox tree for the current uid — rm -rf /tmp/gh-aw/sandbox (or rootless chown via podman unshare) when a pre-existing root-owned residue is detected — then mkdir. A fresh, uid-owned tree eliminates the race.
  2. Resilience: on mkdir EACCES under /tmp/gh-aw/sandbox, attempt ownership repair or fall back to a fresh uid-scoped temp dir instead of fatal-exiting.
  3. Root fix (shared with [aw-failures] Copilot CLI false-red — runs marked failure (exit 1) after safe-outputs succeed, via "numerous permission denied" [Content truncated due to length] #41636/[aw-failures] Claude false-red — log_parser_bootstrap fails completed runs on empty logEntries (Avenger, Daily Rendering Scripts [Content truncated due to length] #41885): make post-run cleanup reliably remove root-owned sandbox residue — the recurring "Rootless artifact permission repair failed (exit 1)" warnings prove teardown is not reclaiming these dirs, so the next run inherits them.

Success criteria / verification

  • Smoke CI no longer aborts with EACCES: mkdir '/tmp/gh-aw/sandbox/firewall/logs'.
  • A runner seeded with a leftover root-owned /tmp/gh-aw/sandbox from a prior rootless run still starts the firewall and invokes the agent.
  • A teardown assertion confirms /tmp/gh-aw/sandbox is fully removed (no root-owned residue) after each run.

Existing-issue correlation

Analyzed run IDs: 28413001230 (representative), comparator 28413042897.

References: §28413001230 · §28413042897

Generated by 🔍 [aw] Failure Investigator (6h) · 153.3 AIC · ⌖ 38 AIC · ⊞ 5.6K ·

  • expires on Jul 6, 2026, 5:40 PM UTC-08:00

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions