|
| 1 | +--- |
| 2 | +name: debug-mcp-deploy |
| 3 | +description: Debug MCP server deployment issues on staging and production. Use when investigating pod failures, auth errors, JWKS verification failures, token mismatches, Helm deploy issues, or environment config problems. Triggers on debug deploy, staging error, pod logs, auth failure, JWKS, 401. |
| 4 | +--- |
| 5 | + |
| 6 | +# Debugging MCP Server Deployments |
| 7 | + |
| 8 | +Commands below use `<staging-namespace>` as a placeholder. Resolve it from `values.staging.yaml` and the deploy workflow (the Helm release name and namespace match). |
| 9 | + |
| 10 | +## Deployment Architecture |
| 11 | + |
| 12 | +The MCP server uses a Helm chart with layered values and SOPS-encrypted secrets. |
| 13 | + |
| 14 | +### Environments |
| 15 | + |
| 16 | +Production and staging namespaces, hosts, and Redis DBs are defined in `values.yaml` and `values.staging.yaml`. Check those files for current values. |
| 17 | + |
| 18 | +Both environments hit the **same production EveryRow API** (there is no staging API). This means Supabase JWTs must be verifiable by the production API. |
| 19 | + |
| 20 | +### Values Layering |
| 21 | + |
| 22 | +``` |
| 23 | +helm upgrade <staging-namespace> . \ |
| 24 | + -f values.yaml # Base config |
| 25 | + -f values.staging.yaml # Staging overrides (MCP_SERVER_URL, REDIS_DB, host) |
| 26 | + -f values.secrets.staging.yaml # Decrypted from secrets.staging.enc.yaml |
| 27 | +``` |
| 28 | + |
| 29 | +### Key Files |
| 30 | + |
| 31 | +| File | Purpose | |
| 32 | +|---|---| |
| 33 | +| `everyrow-mcp/deploy/chart/values.yaml` | Base Helm values | |
| 34 | +| `everyrow-mcp/deploy/chart/values.staging.yaml` | Staging overrides | |
| 35 | +| `everyrow-mcp/deploy/chart/secrets.enc.yaml` | Production secrets (SOPS) | |
| 36 | +| `everyrow-mcp/deploy/chart/secrets.staging.enc.yaml` | Staging secrets (SOPS) | |
| 37 | +| `.github/workflows/deploy-mcp.yaml` | CI/CD workflow | |
| 38 | + |
| 39 | +## Debugging Workflow |
| 40 | + |
| 41 | +### Step 1: Check pod status and logs |
| 42 | + |
| 43 | +```bash |
| 44 | +# Pod status |
| 45 | +kubectl get pods -n <staging-namespace> -o wide |
| 46 | + |
| 47 | +# Recent logs (INFO level by default) |
| 48 | +kubectl logs -n <staging-namespace> -l app=<staging-namespace> --tail=200 |
| 49 | + |
| 50 | +# Filter for errors |
| 51 | +kubectl logs -n <staging-namespace> -l app=<staging-namespace> --tail=500 | grep -iE "error|warn|401|500|fail" |
| 52 | +``` |
| 53 | + |
| 54 | +Note: `verify_token` logs JWT failures at DEBUG level only. If you suspect auth issues but see no errors, the token verification is silently returning `None`. |
| 55 | + |
| 56 | +### Step 2: Inspect pod environment |
| 57 | + |
| 58 | +```bash |
| 59 | +# Check all relevant env vars |
| 60 | +kubectl exec -n <staging-namespace> deploy/<staging-namespace> -- env | grep -iE "SUPABASE|MCP_SERVER|REDIS|EVERYROW_API" |
| 61 | +``` |
| 62 | + |
| 63 | +### Step 3: Compare secrets across environments |
| 64 | + |
| 65 | +```bash |
| 66 | +# Decrypt and compare |
| 67 | +sops -d everyrow-mcp/deploy/chart/secrets.enc.yaml # Production |
| 68 | +sops -d everyrow-mcp/deploy/chart/secrets.staging.enc.yaml # Staging |
| 69 | +``` |
| 70 | + |
| 71 | +### Step 4: Update a secret value |
| 72 | + |
| 73 | +```bash |
| 74 | +sops --set '["secrets"]["data"]["KEY_NAME"] "new-value"' everyrow-mcp/deploy/chart/secrets.staging.enc.yaml |
| 75 | +``` |
| 76 | + |
| 77 | +## Auth Flow Debugging |
| 78 | + |
| 79 | +The MCP server uses a 3-leg OAuth flow: Google → Supabase → MCP Server → Claude. |
| 80 | + |
| 81 | +### Token verification path |
| 82 | + |
| 83 | +1. Client sends Bearer token (a Supabase JWT) with each `/mcp` request |
| 84 | +2. `SupabaseTokenVerifier.verify_token()` fetches JWKS from `{SUPABASE_URL}/auth/v1/.well-known/jwks.json` |
| 85 | +3. Finds signing key by matching JWT header `kid` against JWKS keys |
| 86 | +4. Decodes JWT with issuer=`{SUPABASE_URL}/auth/v1`, audience=`authenticated` |
| 87 | +5. If valid, the JWT is also used as Bearer token for EveryRow API calls (`app.py:_http_client_factory`) |
| 88 | + |
| 89 | +### Test JWKS endpoint from inside the pod |
| 90 | + |
| 91 | +```bash |
| 92 | +kubectl exec -n <staging-namespace> deploy/<staging-namespace> -- python3 -c " |
| 93 | +import httpx, json, os |
| 94 | +url = os.environ['SUPABASE_URL'] |
| 95 | +r = httpx.get(f'{url}/auth/v1/.well-known/jwks.json') |
| 96 | +print(json.dumps(r.json(), indent=2)) |
| 97 | +" |
| 98 | +``` |
| 99 | + |
| 100 | +### Test token verification end-to-end |
| 101 | + |
| 102 | +```bash |
| 103 | +kubectl exec -n <staging-namespace> deploy/<staging-namespace> -- python3 -c " |
| 104 | +import httpx, os |
| 105 | +api = os.environ.get('EVERYROW_API_URL', 'https://everyrow.io/api/v0') |
| 106 | +r = httpx.post(f'{api}/sessions', |
| 107 | + headers={'Authorization': 'Bearer fake.test.token', 'Content-Type': 'application/json'}, |
| 108 | + json={}, timeout=10) |
| 109 | +print(f'Status: {r.status_code}') |
| 110 | +print(f'Body: {r.text[:300]}') |
| 111 | +" |
| 112 | +``` |
| 113 | + |
| 114 | +## Common Issues |
| 115 | + |
| 116 | +### "No matching signing key found in JWKS" |
| 117 | + |
| 118 | +**Symptom**: Tool calls fail with 401: `{"detail":"Error verifying token: No matching signing key found in JWKS"}` |
| 119 | + |
| 120 | +**Root cause**: Supabase project mismatch. The MCP server authenticates via one Supabase project but the EveryRow API verifies tokens against a different project's JWKS. The JWT `kid` doesn't match. |
| 121 | + |
| 122 | +**Diagnosis**: |
| 123 | +1. Check which Supabase project staging uses: `sops -d secrets.staging.enc.yaml | grep SUPABASE_URL` |
| 124 | +2. Check which project production uses: `sops -d secrets.enc.yaml | grep SUPABASE_URL` |
| 125 | +3. If they differ, staging JWTs won't be accepted by the production EveryRow API |
| 126 | + |
| 127 | +**Fix**: Staging must use the same Supabase project as production (since both hit the same EveryRow API): |
| 128 | +```bash |
| 129 | +sops --set '["secrets"]["data"]["SUPABASE_URL"] "https://<prod-project>.supabase.co"' secrets.staging.enc.yaml |
| 130 | +sops --set '["secrets"]["data"]["SUPABASE_ANON_KEY"] "<prod-anon-key>"' secrets.staging.enc.yaml |
| 131 | +``` |
| 132 | + |
| 133 | +### Silent 401s with no error in logs |
| 134 | + |
| 135 | +**Symptom**: `/mcp` requests return 401 but no error appears in pod logs. |
| 136 | + |
| 137 | +**Root cause**: `SupabaseTokenVerifier.verify_token()` catches all `PyJWTError` exceptions and logs at DEBUG level only, then returns `None`. The MCP SDK middleware sees `None` and returns 401. |
| 138 | + |
| 139 | +**Diagnosis**: Exec into the pod and manually test token verification, or temporarily increase log level. |
| 140 | + |
| 141 | +### Token works initially then stops |
| 142 | + |
| 143 | +**Symptom**: Tools work right after OAuth, then start returning 401 after ~1 hour. |
| 144 | + |
| 145 | +**Root cause**: Supabase JWTs expire after 1 hour. The MCP client should use the refresh token to get a new JWT. Check if the refresh flow works. |
| 146 | + |
| 147 | +**Diagnosis**: Look for `grant_type=refresh_token` requests in logs and check their status codes. |
| 148 | + |
| 149 | +### Pod can't reach Supabase JWKS |
| 150 | + |
| 151 | +**Symptom**: JWKS fetch times out (logged as "JWKS fetch timed out (10s)"). |
| 152 | + |
| 153 | +**Diagnosis**: Test network from inside the pod: |
| 154 | +```bash |
| 155 | +kubectl exec -n <staging-namespace> deploy/<staging-namespace> -- python3 -c " |
| 156 | +import httpx, os |
| 157 | +r = httpx.get(f'{os.environ[\"SUPABASE_URL\"]}/auth/v1/.well-known/jwks.json', timeout=10) |
| 158 | +print(r.status_code, r.text[:200]) |
| 159 | +" |
| 160 | +``` |
| 161 | + |
| 162 | +## Deploying a Fix |
| 163 | + |
| 164 | +After updating secrets or config: |
| 165 | + |
| 166 | +1. Commit and push to the branch |
| 167 | +2. Trigger the deploy workflow: |
| 168 | + ```bash |
| 169 | + gh workflow run deploy-mcp.yaml -f branch=<branch> -f deploy_staging=true |
| 170 | + ``` |
| 171 | +3. Monitor the deployment: |
| 172 | + ```bash |
| 173 | + kubectl rollout status deploy/<staging-namespace> -n <staging-namespace> --timeout=5m |
| 174 | + ``` |
| 175 | +4. Verify the fix by checking logs for successful auth flows |
| 176 | + |
| 177 | +## Gotchas |
| 178 | + |
| 179 | +- **No staging EveryRow API**: `values.staging.yaml` intentionally does NOT override `EVERYROW_API_URL`. Both environments use production. |
| 180 | +- **SOPS KMS access**: Decryption requires GCP IAM permissions on the KMS key referenced in the `.sops` metadata of each encrypted file. Run `gcloud auth application-default login` if it fails. |
| 181 | +- **Redis DB isolation**: Staging and production use different Redis DB indices (see `values.yaml` / `values.staging.yaml`). They share the same Redis Sentinel cluster. |
| 182 | +- **`_UNSAFE_decode_server_jwt`**: Hardcodes `algorithms=["RS256"]` but this is only for trusted server-to-server token inspection (signature verification disabled). It does NOT affect client-facing token verification. |
0 commit comments