Skip to content

feat(gateway): add GitHub App authentication#673

Merged
marcusrbrown merged 1 commit into
mainfrom
feat/gateway-github-app-auth
May 24, 2026
Merged

feat(gateway): add GitHub App authentication#673
marcusrbrown merged 1 commit into
mainfrom
feat/gateway-github-app-auth

Conversation

@marcusrbrown
Copy link
Copy Markdown
Collaborator

Adds GitHub App authentication to the gateway. Second PR in the Unit 5 series after the bindings store (#672).

Two-stage auth flow:

  1. JWT-level Octokit auth via createAppAuth({appId, privateKey}) — used once per authForRepo call to look up the installation for a given (owner, repo).
  2. Installation-token Octokit via createAppAuth({appId, privateKey, installationId}) — returned to the caller, used for the actual repo API calls.

The discovery step (GET /repos/{owner}/{repo}/installation) is new code in this repo — the existing action-tier client only supports the second stage. The gateway needs the first stage because the install URL is the operator-facing recovery path when the App isn't installed on the target repo.

Permission verification at discovery time:

  • Required minimum: contents: read (future units will add more).
  • Over-privileged installation (contents: write, etc.) → succeeds with a WARN log. Operators may have other tools sharing the App.
  • Under-privileged or missing contents → returns InsufficientPermissionsError with the App permissions URL.

Caching:

  • (owner, repo) → installationId cached in-memory for the gateway lifetime.
  • Cached entries are best-effort — invalidateCache(owner, repo) evicts on auth failure (consumers will call this on 401/404).
  • JWTs themselves are NOT cached. They're sub-millisecond to mint and short-lived; caching adds complexity for no measurable savings.

Credential handling (security-critical):

  • _FILE convention via readSecret('GITHUB_APP_ID') and readMultilineSecret('GITHUB_APP_PRIVATE_KEY'). Private keys never appear in env vars.
  • Restart required after rotating secrets/github-app-private-key — bind-mounted files aren't reloaded by the running process. Documented in the README.
  • safeErrorMessage() scrubs PEM blocks and JWT-shaped strings from any error message that surfaces.
  • Captured-logger test asserts no JWT, private key, or installation token appears in any log line emitted by the client.

Compose changes:

README changes:

  • New "GitHub App" section covering App creation, required permissions, credential file setup, key rotation, and the install URL.

Test coverage: 18 new tests covering all happy paths, under/over-privileged installations, missing credentials, network failures, error scrubbing, and cache invalidation. 197 gateway tests total (was 179).

Out of scope: nothing in this PR uses the App client yet. PR C (workspace-agent scaffold) is independent and ships next; PR D wires the slash command and consumes this client.

Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: CONDITIONAL

Can merge after addressing the one blocking issue below.


Blocking issues

Stage-2 auth failure permanently poisons the cache for that pair.

When is already cached and throws (e.g. the installation was revoked, the App key was rotated, or the GitHub API returns a transient error), the outer returns without evicting the cached entry. Every subsequent call for that pair skips stage 1 (cache hit), fails again at stage 2, and returns the same — forever, until process restart or a manual call that the caller can't make because they received instead of a live .

The consumer-facing contract on covers 401/404 from the returned Octokit, not from stage-2 token minting itself. These are different failure sites.

Fix: auto-invalidate in the stage-2 failure path before returning the error:

// Stage 2: Mint an installation token using the discovered installationId.
const installAuth = createAppAuth({appId, privateKey, installationId})
let token: string
try {
  ;({token} = await installAuth({type: 'installation'}))
} catch (mintError) {
  // Stage-2 failure means the cached installationId is no longer usable.
  // Evict it so the next call re-discovers rather than failing indefinitely.
  installationCache.delete(cacheKey(owner, repo))
  return err(new AuthError(safeErrorMessage(mintError)))
}

Non-blocking concerns

  1. ** is undocumented as sensitive.** The raw installation token lives on the returned result. Callers downstream (PR D) should not log it. A one-line JSDoc on the field () makes the invariant explicit at the call site.

  2. ** / are hard-required () at startup, not optional.** This is a real operational break for existing deployments — ing the files (as the README prescribes for optional secrets) won't work; operators must populate them. The README correctly says "must create" rather than "touch", so this is documented, but worth a verbal heads-up in deploy comms alongside the PR.

  3. Over-privileged WARN fires only on the first discovery. If an operator upgrades the installation's permissions to while the gateway is running, the WARN is never re-logged after the cache is warm. Intentional, but worth a comment in so the next reader doesn't think it's a bug.


Missing tests

  1. Stage-2 failure with a warm cache. No test covers the case where throws for after a successful first call has populated the cache. This is exactly the scenario that exposes the blocking bug above. Add a test that:

    • calls successfully (warms cache),
    • reconfigures to throw for ,
    • calls again and asserts is returned,
    • then reconfigures to succeed again and calls a third time — asserting that the cache was evicted (discovery runs again, is called a second time).
  2. Concurrent discovery calls (non-blocking). Two simultaneous calls for the same uncached will both enter stage 1 and issue two discovery requests. This is harmless (idempotent) but the test suite has no coverage for the concurrent path. Consider a test that fires both before the first resolves.


Risk assessment: MED

  • Blast radius: the blocking bug is latent until an installation token mint fails post-cache-warm. In the steady state (stable installation, valid key) it never triggers. But key rotation — the documented operational procedure — will trigger it on any call made while the old key is still cached.
  • Security exposure: LOW. The credential-scrubbing and convention are well-implemented. The mitmproxy allowlist already covers . No new attack surface.
  • Regression risk: LOW for existing gateway functionality (App client is not wired into any command yet). HIGH for the first PR that wires it.

Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26350117112
Cache hit
Session ses_1a8191f06ffezcPjUPz0AvMKiT

Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: CONDITIONAL

Can merge after addressing the one blocking issue below.


Blocking issues

Stage-2 auth failure permanently poisons the cache for that (owner, repo) pair.

When installationId is already cached and installAuth({type: 'installation'}) throws (e.g. the installation was revoked, the App key was rotated, or the GitHub API returns a transient error), the outer try/catch returns err(new AuthError(...)) without evicting the cached entry. Every subsequent authForRepo call for that pair skips stage 1 (cache hit), fails again at stage 2, and returns the same AuthError — forever, until process restart or a manual invalidateCache call that the caller can't make because they received AuthError instead of a live Octokit.

The consumer-facing contract on invalidateCache covers 401/404 from the returned Octokit, not from stage-2 token minting itself. These are different failure sites.

Fix: auto-invalidate in the stage-2 failure path before returning the error:

// Stage 2: Mint an installation token using the discovered installationId.
const installAuth = createAppAuth({appId, privateKey, installationId})
let token: string
try {
  ;({token} = await installAuth({type: 'installation'}))
} catch (mintError) {
  // Stage-2 failure means the cached installationId is no longer usable.
  // Evict it so the next call re-discovers rather than failing indefinitely.
  installationCache.delete(cacheKey(owner, repo))
  return err(new AuthError(safeErrorMessage(mintError)))
}

Non-blocking concerns

  1. AppClientAuthResult.token is undocumented as sensitive. The raw installation token lives on the returned result. Callers downstream (PR D) should not log it. A one-line JSDoc on the token field (/** Raw installation token — do not log. */) makes the invariant explicit at the call site.

  2. githubAppId / githubAppPrivateKey are hard-required (readSecret) at startup, not optional. This is a real operational break for existing deployments — touching the files (as the README prescribes for optional secrets) won't work; operators must populate them. The README correctly says "must create" rather than "touch", so this is documented, but worth a verbal heads-up in deploy comms alongside the PR.

  3. Over-privileged WARN fires only on the first discovery. If an operator upgrades the installation's permissions to write while the gateway is running, the WARN is never re-logged after the cache is warm. Intentional, but worth a comment in verifyPermissions so the next reader doesn't think it's a bug.


Missing tests

  1. Stage-2 failure with a warm cache. No test covers the case where mockAuth throws for type: 'installation' after a successful first call has populated the cache. This is exactly the scenario that exposes the blocking bug above. Add a test that: (a) calls authForRepo successfully to warm the cache, (b) reconfigures mockAuth to throw for type: 'installation', (c) calls authForRepo again and asserts AuthError is returned, (d) reconfigures mockAuth to succeed again and calls a third time — asserting the cache was evicted by verifying that mockRequest was called twice total.

  2. Concurrent discovery calls (non-blocking). Two simultaneous authForRepo calls for the same uncached (owner, repo) will both enter stage 1 and issue two discovery requests. This is harmless (idempotent) but has no coverage.


Risk assessment: MED

  • Blast radius: the blocking bug is latent until an installation token mint fails post-cache-warm. In the steady state it never triggers. But key rotation — the documented operational procedure — will trigger it on any authForRepo call made before a restart.
  • Security exposure: LOW. The credential-scrubbing and _FILE convention are well-implemented. The mitmproxy allowlist already covers api.github.com. No new attack surface.
  • Regression risk: LOW for existing gateway functionality (App client is not wired into any command yet). HIGH for the first PR that wires it.

Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26350117112
Cache hit
Session ses_1a8191f06ffezcPjUPz0AvMKiT

fro-bot
fro-bot previously approved these changes May 24, 2026
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

All blocking issues and missing tests from the previous review have been addressed in the updated commit (4ce0fa1).

Blocking issues

None. The stage-2 cache-poisoning bug is fixed: installationCache.delete(cacheKey(owner, repo)) is called before returning err(new AuthError(...)) in the mint failure path.

Non-blocking concerns

None. The three non-blocking items from the previous review were also addressed:

  • AppClientAuthResult.token now has a JSDoc warning callers never to log or persist it.
  • verifyPermissions has a comment explaining the over-privileged WARN fires only on cache-miss (intentional).
  • createAppAuth was promoted from a dynamic import to a static top-level import (cleaner).

Missing tests

None. Both missing tests are now present:

  • cache eviction: stage-2 mint failure with warm cache evicts entry so next call re-discovers — covers the full warm-cache → mint-failure → eviction → re-discovery cycle.
  • concurrent discovery: two simultaneous authForRepo calls for same uncached pair both succeed — covers the parallel path.

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

LOW across the board. The App client is not wired into any command yet (deferred to PR D), so there is no regression surface in the current gateway. The credential-handling and cache-eviction logic are now well-covered by the test suite. No new attack surface.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26350320009
Cache hit
Session ses_1a8191f06ffezcPjUPz0AvMKiT

Two-stage auth flow (JWT discovery → installation token) with permission verification, install-URL surfacing on AppNotInstalledError, in-memory installationId cache with invalidation, and error scrubbing for PEM/JWT/token leaks. _FILE convention for credentials with restart-required key rotation documented in README. 18 new tests.
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

This is a rebase of the previously approved commit (4ce0fa1) onto an updated base. The implementation is identical in all substantive respects.

Blocking issues

None.

Non-blocking concerns

None. The two trivial changes introduced by the rebase are both improvements:

  • app-client.ts: logger interface properties (warn, debug) gained readonly modifiers — stricter, correct.
  • app-client.test.ts: early-return guards changed from if (!result.success) / if (result.success) to explicit === false / === true comparisons — consistent with the rest of the test suite.

All items from the prior review cycle (blocking bug, token JSDoc, over-privileged WARN comment, stage-2 cache-eviction test, concurrent discovery test) remain addressed.

Missing tests

None.

Risk assessment (LOW): likelihood of regression, security exposure, or blast radius

Identical to the previous assessment. App client is not wired into any command yet; no regression surface in the current gateway; no new attack surface.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26350640692
Cache hit
Session ses_1a8191f06ffezcPjUPz0AvMKiT

@marcusrbrown marcusrbrown merged commit a2bdad9 into main May 24, 2026
10 checks passed
@marcusrbrown marcusrbrown deleted the feat/gateway-github-app-auth branch May 24, 2026 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants