Skip to content

feat(native): native macOS runner mode for trusted repos#91

Open
ephpm-claude[bot] wants to merge 7 commits into
mainfrom
worktree-feat-native-macos-runner
Open

feat(native): native macOS runner mode for trusted repos#91
ephpm-claude[bot] wants to merge 7 commits into
mainfrom
worktree-feat-native-macos-runner

Conversation

@ephpm-claude

@ephpm-claude ephpm-claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Summary

  • Native macOS runner mode: run GHA jobs directly on the host for trusted repos — 4+ concurrent jobs (vs Apple's 2-VM cap), zero boot overhead, ~200MB/job. Configured per-repo under [runner.macos] with org/repo keys and org/* wildcards. Separate nativeMacSem gate; VM path untouched.
  • Never runs as root: jobs execute as a hidden _ephemerd service user (created lazily, like _www). Per-job ephemeral users were attempted and abandoned: macOS user deletion requires Full Disk Access and wedges opendirectoryd. user = "..." config overrides.
  • Per-job isolation: own HOME/TMPDIR/work dir, keychain, Homebrew prefix, and a sandbox-exec profile (deny localhost outbound + all port binding; CIDR rules are unsupported by sandbox-exec — pf firewall is a follow-up).

Bug fixes found along the way

  • Runner extraction is OS-suffixed (runners/<ver>-<goos>): macOS host and Linux VM were corrupting each other's runner on the shared data dir, causing Linux dispatch exit 127.
  • isOfficialRunnerImage prefixes had a trailing dash that never matched the runner-ci-linux tag — custom-image Linux dispatch always exited 127.
  • DEVELOPER_DIR resolved via xcode-select -p (hardcoded Xcode.app path broke git on CLT-only hosts).

Test plan

  • Unit tests: config (ModeForRepo, wildcards, ResolvedMaxNative), scheduler (routing, semaphore, canHandleJob), native (sandbox profile, workspace, copy)
  • Live: smoke-test jobs ran end-to-end on this Mac mini as _ephemerd — all steps green incl. checkout
  • Live: 4 concurrent native runners verified
  • php-sdk / ephpm VM-path jobs unaffected; Linux arm64 dispatch fixed and verified green

🤖 Generated with Claude Code

Luther Monson and others added 7 commits May 31, 2026 18:16
Run GHA jobs directly on the macOS host instead of per-job VMs,
enabling 4+ concurrent jobs (vs Apple's 2-VM cap) with zero boot
overhead. Configured per-repo under [runner.macos] with "org/repo"
keys, "org/*" wildcards, and a separate nativeMacSem concurrency gate.
The VM path is untouched.

Jobs never run as root: a hidden _ephemerd service user is created
lazily (per-job ephemeral users were abandoned — macOS user deletion
requires Full Disk Access and wedges opendirectoryd). Each job gets
its own HOME/TMPDIR/work dir, keychain, Homebrew prefix, and a
sandbox-exec profile denying localhost outbound and port binding.

Also fixes uncovered along the way:
- runner extraction is OS-suffixed (runners/<ver>-<goos>) so the
  macOS host and Linux VM no longer corrupt each other's runner on
  the shared data dir (Linux dispatch exit 127)
- isOfficialRunnerImage prefixes had a trailing dash that never
  matched the runner-ci-linux tag, breaking custom-image dispatch
- DEVELOPER_DIR resolved via xcode-select -p instead of hardcoded
  Xcode.app path (broke git on CLT-only hosts)
- macOS VM runner monitor logs pgrep results at debug level

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Security follow-ups from review of the native runner. Native jobs run
directly on the host with no VM boundary, so the sandbox profile and
unix permissions are the entire isolation story — two concrete holes
closed here, plus one documented as needing live-macOS work.

1. Sibling-job + daemon-state isolation. Every native job runs as the
   same _ephemerd uid and all workspaces live under <dataDir>/native/,
   so a job could read a concurrent job's checkout token or source.
   The profile now denies read AND write of the whole <dataDir>/native
   subtree and re-allows only the job's own dir (sandbox-exec applies
   the last matching rule). config.toml, ephemerd.sock, and the vm dir
   gain write denies to match their existing read denies.

2. .ssh write hole. .ssh was read-denied but writable, leaving an
   authorized_keys append vector on any host where the runner uid can
   reach the target home. Now denied for write too.

3. Dedicated primary group instead of staff (gid 20). staff is the
   default group for every normal macOS account, so the runner process
   inherited group access to the many staff-group-owned files on a
   typical Mac. The service user now gets a dedicated _ephemerd group.
   Provisioning is best-effort: any failure falls back to staff (the
   previously-tested behavior), so a group hiccup never blocks jobs.

Not done here (documented in a code comment as a follow-up): flipping
the profile from allow-by-default to deny-by-default. That is the
stronger posture for native execution but requires enumerating every
path the GHA runner + toolchains touch and live-testing on macOS so
jobs don't break — can't be verified blind from a non-macOS host.

The LAN-egress gap (sandbox-exec has no CIDR support; pf rules still a
follow-up) is unchanged and remains the reason native mode should stay
restricted to trusted first-party repos.
The hardened sandbox blocked the GHA runner from starting. Three
distinct macOS sandbox-exec behaviors, each found via local repro:

1. deny file-read* on the native subtree blocked file-read-metadata,
   which realpath() needs to traverse through native/ to the job dir.
   The .NET host died with "Failed to resolve full path of the current
   executable" (exit 133). Fixed: deny only file-read-data.

2. getcwd() and bash walk UP from the job's runner dir and must
   readdir(native/) to learn the job-id component name; the read-data
   deny on the native subtree blocked that, giving "getcwd: cannot
   access parent directories" and "run.sh: Operation not permitted"
   (exit 126). Fixed: allow file-read-data on the native dir node
   (literal) — leaks only the non-secret list of concurrent job ids.

3. macOS sandbox resolves a specific-operation deny (file-read-data)
   over a later wildcard allow (file-read*), so the per-job re-allow
   must name file-read-data explicitly to win. Added an explicit
   file-read-data re-allow on the job subtree alongside file-read*.

Job-to-job isolation is preserved: a sibling job's directory listing
and file contents stay denied (verified). Smoke-test jobs now run
end-to-end as _ephemerd with all steps green.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant