feat(gateway): add GitHub App authentication#673
Conversation
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: CONDITIONAL
Can merge after addressing the one blocking issue below.
Blocking issues
Stage-2 auth failure permanently poisons the cache for that pair.
When is already cached and throws (e.g. the installation was revoked, the App key was rotated, or the GitHub API returns a transient error), the outer returns without evicting the cached entry. Every subsequent call for that pair skips stage 1 (cache hit), fails again at stage 2, and returns the same — forever, until process restart or a manual call that the caller can't make because they received instead of a live .
The consumer-facing contract on covers 401/404 from the returned Octokit, not from stage-2 token minting itself. These are different failure sites.
Fix: auto-invalidate in the stage-2 failure path before returning the error:
// Stage 2: Mint an installation token using the discovered installationId.
const installAuth = createAppAuth({appId, privateKey, installationId})
let token: string
try {
;({token} = await installAuth({type: 'installation'}))
} catch (mintError) {
// Stage-2 failure means the cached installationId is no longer usable.
// Evict it so the next call re-discovers rather than failing indefinitely.
installationCache.delete(cacheKey(owner, repo))
return err(new AuthError(safeErrorMessage(mintError)))
}Non-blocking concerns
-
** is undocumented as sensitive.** The raw installation token lives on the returned result. Callers downstream (PR D) should not log it. A one-line JSDoc on the field () makes the invariant explicit at the call site.
-
** / are hard-required () at startup, not optional.** This is a real operational break for existing deployments — ing the files (as the README prescribes for optional secrets) won't work; operators must populate them. The README correctly says "must create" rather than "touch", so this is documented, but worth a verbal heads-up in deploy comms alongside the PR.
-
Over-privileged WARN fires only on the first discovery. If an operator upgrades the installation's permissions to while the gateway is running, the WARN is never re-logged after the cache is warm. Intentional, but worth a comment in so the next reader doesn't think it's a bug.
Missing tests
-
Stage-2 failure with a warm cache. No test covers the case where throws for after a successful first call has populated the cache. This is exactly the scenario that exposes the blocking bug above. Add a test that:
- calls successfully (warms cache),
- reconfigures to throw for ,
- calls again and asserts is returned,
- then reconfigures to succeed again and calls a third time — asserting that the cache was evicted (discovery runs again, is called a second time).
-
Concurrent discovery calls (non-blocking). Two simultaneous calls for the same uncached will both enter stage 1 and issue two discovery requests. This is harmless (idempotent) but the test suite has no coverage for the concurrent path. Consider a test that fires both before the first resolves.
Risk assessment: MED
- Blast radius: the blocking bug is latent until an installation token mint fails post-cache-warm. In the steady state (stable installation, valid key) it never triggers. But key rotation — the documented operational procedure — will trigger it on any call made while the old key is still cached.
- Security exposure: LOW. The credential-scrubbing and convention are well-implemented. The mitmproxy allowlist already covers . No new attack surface.
- Regression risk: LOW for existing gateway functionality (App client is not wired into any command yet). HIGH for the first PR that wires it.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26350117112 |
| Cache | hit |
| Session | ses_1a8191f06ffezcPjUPz0AvMKiT |
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: CONDITIONAL
Can merge after addressing the one blocking issue below.
Blocking issues
Stage-2 auth failure permanently poisons the cache for that (owner, repo) pair.
When installationId is already cached and installAuth({type: 'installation'}) throws (e.g. the installation was revoked, the App key was rotated, or the GitHub API returns a transient error), the outer try/catch returns err(new AuthError(...)) without evicting the cached entry. Every subsequent authForRepo call for that pair skips stage 1 (cache hit), fails again at stage 2, and returns the same AuthError — forever, until process restart or a manual invalidateCache call that the caller can't make because they received AuthError instead of a live Octokit.
The consumer-facing contract on invalidateCache covers 401/404 from the returned Octokit, not from stage-2 token minting itself. These are different failure sites.
Fix: auto-invalidate in the stage-2 failure path before returning the error:
// Stage 2: Mint an installation token using the discovered installationId.
const installAuth = createAppAuth({appId, privateKey, installationId})
let token: string
try {
;({token} = await installAuth({type: 'installation'}))
} catch (mintError) {
// Stage-2 failure means the cached installationId is no longer usable.
// Evict it so the next call re-discovers rather than failing indefinitely.
installationCache.delete(cacheKey(owner, repo))
return err(new AuthError(safeErrorMessage(mintError)))
}Non-blocking concerns
-
AppClientAuthResult.tokenis undocumented as sensitive. The raw installation token lives on the returned result. Callers downstream (PR D) should not log it. A one-line JSDoc on thetokenfield (/** Raw installation token — do not log. */) makes the invariant explicit at the call site. -
githubAppId/githubAppPrivateKeyare hard-required (readSecret) at startup, not optional. This is a real operational break for existing deployments —touching the files (as the README prescribes for optional secrets) won't work; operators must populate them. The README correctly says "must create" rather than "touch", so this is documented, but worth a verbal heads-up in deploy comms alongside the PR. -
Over-privileged WARN fires only on the first discovery. If an operator upgrades the installation's permissions to
writewhile the gateway is running, the WARN is never re-logged after the cache is warm. Intentional, but worth a comment inverifyPermissionsso the next reader doesn't think it's a bug.
Missing tests
-
Stage-2 failure with a warm cache. No test covers the case where
mockAuththrows fortype: 'installation'after a successful first call has populated the cache. This is exactly the scenario that exposes the blocking bug above. Add a test that: (a) callsauthForReposuccessfully to warm the cache, (b) reconfiguresmockAuthto throw fortype: 'installation', (c) callsauthForRepoagain and assertsAuthErroris returned, (d) reconfiguresmockAuthto succeed again and calls a third time — asserting the cache was evicted by verifying thatmockRequestwas called twice total. -
Concurrent discovery calls (non-blocking). Two simultaneous
authForRepocalls for the same uncached(owner, repo)will both enter stage 1 and issue two discovery requests. This is harmless (idempotent) but has no coverage.
Risk assessment: MED
- Blast radius: the blocking bug is latent until an installation token mint fails post-cache-warm. In the steady state it never triggers. But key rotation — the documented operational procedure — will trigger it on any
authForRepocall made before a restart. - Security exposure: LOW. The credential-scrubbing and
_FILEconvention are well-implemented. The mitmproxy allowlist already coversapi.github.com. No new attack surface. - Regression risk: LOW for existing gateway functionality (App client is not wired into any command yet). HIGH for the first PR that wires it.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26350117112 |
| Cache | hit |
| Session | ses_1a8191f06ffezcPjUPz0AvMKiT |
51e0ade to
4ce0fa1
Compare
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
All blocking issues and missing tests from the previous review have been addressed in the updated commit (4ce0fa1).
Blocking issues
None. The stage-2 cache-poisoning bug is fixed: installationCache.delete(cacheKey(owner, repo)) is called before returning err(new AuthError(...)) in the mint failure path.
Non-blocking concerns
None. The three non-blocking items from the previous review were also addressed:
AppClientAuthResult.tokennow has a JSDoc warning callers never to log or persist it.verifyPermissionshas a comment explaining the over-privileged WARN fires only on cache-miss (intentional).createAppAuthwas promoted from a dynamic import to a static top-level import (cleaner).
Missing tests
None. Both missing tests are now present:
cache eviction: stage-2 mint failure with warm cache evicts entry so next call re-discovers— covers the full warm-cache → mint-failure → eviction → re-discovery cycle.concurrent discovery: two simultaneous authForRepo calls for same uncached pair both succeed— covers the parallel path.
Risk assessment (LOW): likelihood of regression, security exposure, or blast radius
LOW across the board. The App client is not wired into any command yet (deferred to PR D), so there is no regression surface in the current gateway. The credential-handling and cache-eviction logic are now well-covered by the test suite. No new attack surface.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26350320009 |
| Cache | hit |
| Session | ses_1a8191f06ffezcPjUPz0AvMKiT |
Two-stage auth flow (JWT discovery → installation token) with permission verification, install-URL surfacing on AppNotInstalledError, in-memory installationId cache with invalidation, and error scrubbing for PEM/JWT/token leaks. _FILE convention for credentials with restart-required key rotation documented in README. 18 new tests.
4ce0fa1 to
3362497
Compare
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
This is a rebase of the previously approved commit (4ce0fa1) onto an updated base. The implementation is identical in all substantive respects.
Blocking issues
None.
Non-blocking concerns
None. The two trivial changes introduced by the rebase are both improvements:
app-client.ts: logger interface properties (warn,debug) gainedreadonlymodifiers — stricter, correct.app-client.test.ts: early-return guards changed fromif (!result.success)/if (result.success)to explicit=== false/=== truecomparisons — consistent with the rest of the test suite.
All items from the prior review cycle (blocking bug, token JSDoc, over-privileged WARN comment, stage-2 cache-eviction test, concurrent discovery test) remain addressed.
Missing tests
None.
Risk assessment (LOW): likelihood of regression, security exposure, or blast radius
Identical to the previous assessment. App client is not wired into any command yet; no regression surface in the current gateway; no new attack surface.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26350640692 |
| Cache | hit |
| Session | ses_1a8191f06ffezcPjUPz0AvMKiT |
Adds GitHub App authentication to the gateway. Second PR in the Unit 5 series after the bindings store (#672).
Two-stage auth flow:
createAppAuth({appId, privateKey})— used once perauthForRepocall to look up the installation for a given(owner, repo).createAppAuth({appId, privateKey, installationId})— returned to the caller, used for the actual repo API calls.The discovery step (GET
/repos/{owner}/{repo}/installation) is new code in this repo — the existing action-tier client only supports the second stage. The gateway needs the first stage because the install URL is the operator-facing recovery path when the App isn't installed on the target repo.Permission verification at discovery time:
contents: read(future units will add more).contents: write, etc.) → succeeds with a WARN log. Operators may have other tools sharing the App.contents→ returnsInsufficientPermissionsErrorwith the App permissions URL.Caching:
(owner, repo) → installationIdcached in-memory for the gateway lifetime.invalidateCache(owner, repo)evicts on auth failure (consumers will call this on 401/404).Credential handling (security-critical):
_FILEconvention viareadSecret('GITHUB_APP_ID')andreadMultilineSecret('GITHUB_APP_PRIVATE_KEY'). Private keys never appear in env vars.secrets/github-app-private-key— bind-mounted files aren't reloaded by the running process. Documented in the README.safeErrorMessage()scrubs PEM blocks and JWT-shaped strings from any error message that surfaces.Compose changes:
secrets/github-app-id,secrets/github-app-private-key. Samecreate_host_path: falseposture as the existing optional secrets (PR fix(gateway): harden deploy contract for infra-as-code consumers #649 / fix(gateway): plumb DISCORD_PRIVILEGED_INTENTS through compose #652). Missing source files produce a cleardocker compose uperror._FILEenv vars in the gateway service block.README changes:
Test coverage: 18 new tests covering all happy paths, under/over-privileged installations, missing credentials, network failures, error scrubbing, and cache invalidation. 197 gateway tests total (was 179).
Out of scope: nothing in this PR uses the App client yet. PR C (workspace-agent scaffold) is independent and ships next; PR D wires the slash command and consumes this client.