feat(gateway): add GitHub App authentication by marcusrbrown · Pull Request #673 · fro-bot/agent

marcusrbrown · 2026-05-24T02:51:53Z

Adds GitHub App authentication to the gateway. Second PR in the Unit 5 series after the bindings store (#672).

Two-stage auth flow:

JWT-level Octokit auth via createAppAuth({appId, privateKey}) — used once per authForRepo call to look up the installation for a given (owner, repo).
Installation-token Octokit via createAppAuth({appId, privateKey, installationId}) — returned to the caller, used for the actual repo API calls.

The discovery step (GET /repos/{owner}/{repo}/installation) is new code in this repo — the existing action-tier client only supports the second stage. The gateway needs the first stage because the install URL is the operator-facing recovery path when the App isn't installed on the target repo.

Permission verification at discovery time:

Required minimum: contents: read (future units will add more).
Over-privileged installation (contents: write, etc.) → succeeds with a WARN log. Operators may have other tools sharing the App.
Under-privileged or missing contents → returns InsufficientPermissionsError with the App permissions URL.

Caching:

(owner, repo) → installationId cached in-memory for the gateway lifetime.
Cached entries are best-effort — invalidateCache(owner, repo) evicts on auth failure (consumers will call this on 401/404).
JWTs themselves are NOT cached. They're sub-millisecond to mint and short-lived; caching adds complexity for no measurable savings.

Credential handling (security-critical):

_FILE convention via readSecret('GITHUB_APP_ID') and readMultilineSecret('GITHUB_APP_PRIVATE_KEY'). Private keys never appear in env vars.
Restart required after rotating secrets/github-app-private-key — bind-mounted files aren't reloaded by the running process. Documented in the README.
safeErrorMessage() scrubs PEM blocks and JWT-shaped strings from any error message that surfaces.
Captured-logger test asserts no JWT, private key, or installation token appears in any log line emitted by the client.

Compose changes:

Two new bind-mounts: secrets/github-app-id, secrets/github-app-private-key. Same create_host_path: false posture as the existing optional secrets (PR fix(gateway): harden deploy contract for infra-as-code consumers #649 / fix(gateway): plumb DISCORD_PRIVILEGED_INTENTS through compose #652). Missing source files produce a clear docker compose up error.
Two new _FILE env vars in the gateway service block.

README changes:

New "GitHub App" section covering App creation, required permissions, credential file setup, key rotation, and the install URL.

Test coverage: 18 new tests covering all happy paths, under/over-privileged installations, missing credentials, network failures, error scrubbing, and cache invalidation. 197 gateway tests total (was 179).

Out of scope: nothing in this PR uses the App client yet. PR C (workspace-agent scaffold) is independent and ships next; PR D wires the slash command and consumes this client.

fro-bot

Verdict: CONDITIONAL

Can merge after addressing the one blocking issue below.

Blocking issues

Stage-2 auth failure permanently poisons the cache for that pair.

When is already cached and throws (e.g. the installation was revoked, the App key was rotated, or the GitHub API returns a transient error), the outer returns without evicting the cached entry. Every subsequent call for that pair skips stage 1 (cache hit), fails again at stage 2, and returns the same — forever, until process restart or a manual call that the caller can't make because they received instead of a live .

The consumer-facing contract on covers 401/404 from the returned Octokit, not from stage-2 token minting itself. These are different failure sites.

Fix: auto-invalidate in the stage-2 failure path before returning the error:

// Stage 2: Mint an installation token using the discovered installationId.
const installAuth = createAppAuth({appId, privateKey, installationId})
let token: string
try {
  ;({token} = await installAuth({type: 'installation'}))
} catch (mintError) {
  // Stage-2 failure means the cached installationId is no longer usable.
  // Evict it so the next call re-discovers rather than failing indefinitely.
  installationCache.delete(cacheKey(owner, repo))
  return err(new AuthError(safeErrorMessage(mintError)))
}

Non-blocking concerns

** is undocumented as sensitive.** The raw installation token lives on the returned result. Callers downstream (PR D) should not log it. A one-line JSDoc on the field () makes the invariant explicit at the call site.
** / are hard-required () at startup, not optional.** This is a real operational break for existing deployments — ing the files (as the README prescribes for optional secrets) won't work; operators must populate them. The README correctly says "must create" rather than "touch", so this is documented, but worth a verbal heads-up in deploy comms alongside the PR.
Over-privileged WARN fires only on the first discovery. If an operator upgrades the installation's permissions to while the gateway is running, the WARN is never re-logged after the cache is warm. Intentional, but worth a comment in so the next reader doesn't think it's a bug.

Missing tests

Stage-2 failure with a warm cache. No test covers the case where throws for after a successful first call has populated the cache. This is exactly the scenario that exposes the blocking bug above. Add a test that:
- calls successfully (warms cache),
- reconfigures to throw for ,
- calls again and asserts is returned,
- then reconfigures to succeed again and calls a third time — asserting that the cache was evicted (discovery runs again, is called a second time).
Concurrent discovery calls (non-blocking). Two simultaneous calls for the same uncached will both enter stage 1 and issue two discovery requests. This is harmless (idempotent) but the test suite has no coverage for the concurrent path. Consider a test that fires both before the first resolves.

Risk assessment: MED

Blast radius: the blocking bug is latent until an installation token mint fails post-cache-warm. In the steady state (stable installation, valid key) it never triggers. But key rotation — the documented operational procedure — will trigger it on any call made while the old key is still cached.
Security exposure: LOW. The credential-scrubbing and convention are well-implemented. The mitmproxy allowlist already covers . No new attack surface.
Regression risk: LOW for existing gateway functionality (App client is not wired into any command yet). HIGH for the first PR that wires it.

Run Summary

Field	Value
Event	pull_request
Repository	fro-bot/agent
Run ID	26350117112
Cache	hit
Session	ses_1a8191f06ffezcPjUPz0AvMKiT

fro-bot

Verdict: CONDITIONAL

Can merge after addressing the one blocking issue below.

Blocking issues

Stage-2 auth failure permanently poisons the cache for that (owner, repo) pair.

When installationId is already cached and installAuth({type: 'installation'}) throws (e.g. the installation was revoked, the App key was rotated, or the GitHub API returns a transient error), the outer try/catch returns err(new AuthError(...)) without evicting the cached entry. Every subsequent authForRepo call for that pair skips stage 1 (cache hit), fails again at stage 2, and returns the same AuthError — forever, until process restart or a manual invalidateCache call that the caller can't make because they received AuthError instead of a live Octokit.

The consumer-facing contract on invalidateCache covers 401/404 from the returned Octokit, not from stage-2 token minting itself. These are different failure sites.

Fix: auto-invalidate in the stage-2 failure path before returning the error:

// Stage 2: Mint an installation token using the discovered installationId.
const installAuth = createAppAuth({appId, privateKey, installationId})
let token: string
try {
  ;({token} = await installAuth({type: 'installation'}))
} catch (mintError) {
  // Stage-2 failure means the cached installationId is no longer usable.
  // Evict it so the next call re-discovers rather than failing indefinitely.
  installationCache.delete(cacheKey(owner, repo))
  return err(new AuthError(safeErrorMessage(mintError)))
}

Non-blocking concerns

AppClientAuthResult.token is undocumented as sensitive. The raw installation token lives on the returned result. Callers downstream (PR D) should not log it. A one-line JSDoc on the token field (/** Raw installation token — do not log. */) makes the invariant explicit at the call site.
githubAppId / githubAppPrivateKey are hard-required (readSecret) at startup, not optional. This is a real operational break for existing deployments — touching the files (as the README prescribes for optional secrets) won't work; operators must populate them. The README correctly says "must create" rather than "touch", so this is documented, but worth a verbal heads-up in deploy comms alongside the PR.
Over-privileged WARN fires only on the first discovery. If an operator upgrades the installation's permissions to write while the gateway is running, the WARN is never re-logged after the cache is warm. Intentional, but worth a comment in verifyPermissions so the next reader doesn't think it's a bug.

Missing tests

Stage-2 failure with a warm cache. No test covers the case where mockAuth throws for type: 'installation' after a successful first call has populated the cache. This is exactly the scenario that exposes the blocking bug above. Add a test that: (a) calls authForRepo successfully to warm the cache, (b) reconfigures mockAuth to throw for type: 'installation', (c) calls authForRepo again and asserts AuthError is returned, (d) reconfigures mockAuth to succeed again and calls a third time — asserting the cache was evicted by verifying that mockRequest was called twice total.
Concurrent discovery calls (non-blocking). Two simultaneous authForRepo calls for the same uncached (owner, repo) will both enter stage 1 and issue two discovery requests. This is harmless (idempotent) but has no coverage.

Risk assessment: MED

Blast radius: the blocking bug is latent until an installation token mint fails post-cache-warm. In the steady state it never triggers. But key rotation — the documented operational procedure — will trigger it on any authForRepo call made before a restart.
Security exposure: LOW. The credential-scrubbing and _FILE convention are well-implemented. The mitmproxy allowlist already covers api.github.com. No new attack surface.
Regression risk: LOW for existing gateway functionality (App client is not wired into any command yet). HIGH for the first PR that wires it.

Run Summary

Field	Value
Event	pull_request
Repository	fro-bot/agent
Run ID	26350117112
Cache	hit
Session	ses_1a8191f06ffezcPjUPz0AvMKiT

fro-bot

Verdict: PASS

All blocking issues and missing tests from the previous review have been addressed in the updated commit (4ce0fa1).

Blocking issues

None. The stage-2 cache-poisoning bug is fixed: installationCache.delete(cacheKey(owner, repo)) is called before returning err(new AuthError(...)) in the mint failure path.

Non-blocking concerns

None. The three non-blocking items from the previous review were also addressed:

AppClientAuthResult.token now has a JSDoc warning callers never to log or persist it.
verifyPermissions has a comment explaining the over-privileged WARN fires only on cache-miss (intentional).
createAppAuth was promoted from a dynamic import to a static top-level import (cleaner).

Missing tests

None. Both missing tests are now present:

cache eviction: stage-2 mint failure with warm cache evicts entry so next call re-discovers — covers the full warm-cache → mint-failure → eviction → re-discovery cycle.
concurrent discovery: two simultaneous authForRepo calls for same uncached pair both succeed — covers the parallel path.

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

LOW across the board. The App client is not wired into any command yet (deferred to PR D), so there is no regression surface in the current gateway. The credential-handling and cache-eviction logic are now well-covered by the test suite. No new attack surface.

Run Summary

Field	Value
Event	pull_request
Repository	fro-bot/agent
Run ID	26350320009
Cache	hit
Session	ses_1a8191f06ffezcPjUPz0AvMKiT

Two-stage auth flow (JWT discovery → installation token) with permission verification, install-URL surfacing on AppNotInstalledError, in-memory installationId cache with invalidation, and error scrubbing for PEM/JWT/token leaks. _FILE convention for credentials with restart-required key rotation documented in README. 18 new tests.

fro-bot

Verdict: PASS

This is a rebase of the previously approved commit (4ce0fa1) onto an updated base. The implementation is identical in all substantive respects.

Blocking issues

None.

Non-blocking concerns

None. The two trivial changes introduced by the rebase are both improvements:

app-client.ts: logger interface properties (warn, debug) gained readonly modifiers — stricter, correct.
app-client.test.ts: early-return guards changed from if (!result.success) / if (result.success) to explicit === false / === true comparisons — consistent with the rest of the test suite.

All items from the prior review cycle (blocking bug, token JSDoc, over-privileged WARN comment, stage-2 cache-eviction test, concurrent discovery test) remain addressed.

Missing tests

None.

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

Identical to the previous assessment. App client is not wired into any command yet; no regression surface in the current gateway; no new attack surface.

Run Summary

Field	Value
Event	pull_request
Repository	fro-bot/agent
Run ID	26350640692
Cache	hit
Session	ses_1a8191f06ffezcPjUPz0AvMKiT

marcusrbrown requested a review from fro-bot as a code owner May 24, 2026 02:51

fro-bot added the agent: working label May 24, 2026

fro-bot requested changes May 24, 2026

View reviewed changes

fro-bot removed the agent: working label May 24, 2026

marcusrbrown force-pushed the feat/gateway-github-app-auth branch from 51e0ade to 4ce0fa1 Compare May 24, 2026 03:02

fro-bot added the agent: working label May 24, 2026

fro-bot previously approved these changes May 24, 2026

View reviewed changes

fro-bot removed the agent: working label May 24, 2026

marcusrbrown dismissed fro-bot’s stale review via 3362497 May 24, 2026 03:19

marcusrbrown force-pushed the feat/gateway-github-app-auth branch from 4ce0fa1 to 3362497 Compare May 24, 2026 03:19

fro-bot added the agent: working label May 24, 2026

fro-bot approved these changes May 24, 2026

View reviewed changes

fro-bot removed the agent: working label May 24, 2026

fro-bot mentioned this pull request May 24, 2026

Daily Org Oversight Report — 2026-05-24 (UTC) fro-bot/.github#3370

Open

8 tasks

marcusrbrown merged commit a2bdad9 into main May 24, 2026
10 checks passed

marcusrbrown deleted the feat/gateway-github-app-auth branch May 24, 2026 04:18

fro-bot Bot mentioned this pull request May 24, 2026

chore(release): pending release v0.45.0 #664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): add GitHub App authentication#673

feat(gateway): add GitHub App authentication#673
marcusrbrown merged 1 commit into
mainfrom
feat/gateway-github-app-auth

marcusrbrown commented May 24, 2026

Uh oh!

fro-bot left a comment

Uh oh!

fro-bot left a comment

Uh oh!

fro-bot left a comment

Uh oh!

fro-bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcusrbrown commented May 24, 2026

Uh oh!

fro-bot left a comment

Choose a reason for hiding this comment

Verdict: CONDITIONAL

Blocking issues

Non-blocking concerns

Missing tests

Risk assessment: MED

Uh oh!

fro-bot left a comment

Choose a reason for hiding this comment

Verdict: CONDITIONAL

Blocking issues

Non-blocking concerns

Missing tests

Risk assessment: MED

Uh oh!

fro-bot left a comment

Choose a reason for hiding this comment

Verdict: PASS

Blocking issues

Non-blocking concerns

Missing tests

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

Uh oh!

fro-bot left a comment

Choose a reason for hiding this comment

Verdict: PASS

Blocking issues

Non-blocking concerns

Missing tests

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants