Skip to content

Commit 0cd0bc0

Browse files
RafaelPoclaude
andcommitted
Add debug-mcp-deploy skill for diagnosing staging/prod issues
Covers Helm values layering, SOPS secrets inspection, auth flow debugging, JWKS verification, and common failure modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ea8678d commit 0cd0bc0

3 files changed

Lines changed: 184 additions & 2 deletions

File tree

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "everyrow",
33
"description": "Claude Code plugin for the everyrow SDK - AI-powered data processing utilities for transforming, deduping, merging, ranking, and screening dataframes",
4-
"version": "0.4.0",
4+
"version": "0.4.1",
55
"author": {
66
"name": "FutureSearch"
77
},
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
---
2+
name: debug-mcp-deploy
3+
description: Debug MCP server deployment issues on staging and production. Use when investigating pod failures, auth errors, JWKS verification failures, token mismatches, Helm deploy issues, or environment config problems. Triggers on debug deploy, staging error, pod logs, auth failure, JWKS, 401.
4+
---
5+
6+
# Debugging MCP Server Deployments
7+
8+
Commands below use `<staging-namespace>` as a placeholder. Resolve it from `values.staging.yaml` and the deploy workflow (the Helm release name and namespace match).
9+
10+
## Deployment Architecture
11+
12+
The MCP server uses a Helm chart with layered values and SOPS-encrypted secrets.
13+
14+
### Environments
15+
16+
Production and staging namespaces, hosts, and Redis DBs are defined in `values.yaml` and `values.staging.yaml`. Check those files for current values.
17+
18+
Both environments hit the **same production EveryRow API** (there is no staging API). This means Supabase JWTs must be verifiable by the production API.
19+
20+
### Values Layering
21+
22+
```
23+
helm upgrade <staging-namespace> . \
24+
-f values.yaml # Base config
25+
-f values.staging.yaml # Staging overrides (MCP_SERVER_URL, REDIS_DB, host)
26+
-f values.secrets.staging.yaml # Decrypted from secrets.staging.enc.yaml
27+
```
28+
29+
### Key Files
30+
31+
| File | Purpose |
32+
|---|---|
33+
| `everyrow-mcp/deploy/chart/values.yaml` | Base Helm values |
34+
| `everyrow-mcp/deploy/chart/values.staging.yaml` | Staging overrides |
35+
| `everyrow-mcp/deploy/chart/secrets.enc.yaml` | Production secrets (SOPS) |
36+
| `everyrow-mcp/deploy/chart/secrets.staging.enc.yaml` | Staging secrets (SOPS) |
37+
| `.github/workflows/deploy-mcp.yaml` | CI/CD workflow |
38+
39+
## Debugging Workflow
40+
41+
### Step 1: Check pod status and logs
42+
43+
```bash
44+
# Pod status
45+
kubectl get pods -n <staging-namespace> -o wide
46+
47+
# Recent logs (INFO level by default)
48+
kubectl logs -n <staging-namespace> -l app=<staging-namespace> --tail=200
49+
50+
# Filter for errors
51+
kubectl logs -n <staging-namespace> -l app=<staging-namespace> --tail=500 | grep -iE "error|warn|401|500|fail"
52+
```
53+
54+
Note: `verify_token` logs JWT failures at DEBUG level only. If you suspect auth issues but see no errors, the token verification is silently returning `None`.
55+
56+
### Step 2: Inspect pod environment
57+
58+
```bash
59+
# Check all relevant env vars
60+
kubectl exec -n <staging-namespace> deploy/<staging-namespace> -- env | grep -iE "SUPABASE|MCP_SERVER|REDIS|EVERYROW_API"
61+
```
62+
63+
### Step 3: Compare secrets across environments
64+
65+
```bash
66+
# Decrypt and compare
67+
sops -d everyrow-mcp/deploy/chart/secrets.enc.yaml # Production
68+
sops -d everyrow-mcp/deploy/chart/secrets.staging.enc.yaml # Staging
69+
```
70+
71+
### Step 4: Update a secret value
72+
73+
```bash
74+
sops --set '["secrets"]["data"]["KEY_NAME"] "new-value"' everyrow-mcp/deploy/chart/secrets.staging.enc.yaml
75+
```
76+
77+
## Auth Flow Debugging
78+
79+
The MCP server uses a 3-leg OAuth flow: Google → Supabase → MCP Server → Claude.
80+
81+
### Token verification path
82+
83+
1. Client sends Bearer token (a Supabase JWT) with each `/mcp` request
84+
2. `SupabaseTokenVerifier.verify_token()` fetches JWKS from `{SUPABASE_URL}/auth/v1/.well-known/jwks.json`
85+
3. Finds signing key by matching JWT header `kid` against JWKS keys
86+
4. Decodes JWT with issuer=`{SUPABASE_URL}/auth/v1`, audience=`authenticated`
87+
5. If valid, the JWT is also used as Bearer token for EveryRow API calls (`app.py:_http_client_factory`)
88+
89+
### Test JWKS endpoint from inside the pod
90+
91+
```bash
92+
kubectl exec -n <staging-namespace> deploy/<staging-namespace> -- python3 -c "
93+
import httpx, json, os
94+
url = os.environ['SUPABASE_URL']
95+
r = httpx.get(f'{url}/auth/v1/.well-known/jwks.json')
96+
print(json.dumps(r.json(), indent=2))
97+
"
98+
```
99+
100+
### Test token verification end-to-end
101+
102+
```bash
103+
kubectl exec -n <staging-namespace> deploy/<staging-namespace> -- python3 -c "
104+
import httpx, os
105+
api = os.environ.get('EVERYROW_API_URL', 'https://everyrow.io/api/v0')
106+
r = httpx.post(f'{api}/sessions',
107+
headers={'Authorization': 'Bearer fake.test.token', 'Content-Type': 'application/json'},
108+
json={}, timeout=10)
109+
print(f'Status: {r.status_code}')
110+
print(f'Body: {r.text[:300]}')
111+
"
112+
```
113+
114+
## Common Issues
115+
116+
### "No matching signing key found in JWKS"
117+
118+
**Symptom**: Tool calls fail with 401: `{"detail":"Error verifying token: No matching signing key found in JWKS"}`
119+
120+
**Root cause**: Supabase project mismatch. The MCP server authenticates via one Supabase project but the EveryRow API verifies tokens against a different project's JWKS. The JWT `kid` doesn't match.
121+
122+
**Diagnosis**:
123+
1. Check which Supabase project staging uses: `sops -d secrets.staging.enc.yaml | grep SUPABASE_URL`
124+
2. Check which project production uses: `sops -d secrets.enc.yaml | grep SUPABASE_URL`
125+
3. If they differ, staging JWTs won't be accepted by the production EveryRow API
126+
127+
**Fix**: Staging must use the same Supabase project as production (since both hit the same EveryRow API):
128+
```bash
129+
sops --set '["secrets"]["data"]["SUPABASE_URL"] "https://<prod-project>.supabase.co"' secrets.staging.enc.yaml
130+
sops --set '["secrets"]["data"]["SUPABASE_ANON_KEY"] "<prod-anon-key>"' secrets.staging.enc.yaml
131+
```
132+
133+
### Silent 401s with no error in logs
134+
135+
**Symptom**: `/mcp` requests return 401 but no error appears in pod logs.
136+
137+
**Root cause**: `SupabaseTokenVerifier.verify_token()` catches all `PyJWTError` exceptions and logs at DEBUG level only, then returns `None`. The MCP SDK middleware sees `None` and returns 401.
138+
139+
**Diagnosis**: Exec into the pod and manually test token verification, or temporarily increase log level.
140+
141+
### Token works initially then stops
142+
143+
**Symptom**: Tools work right after OAuth, then start returning 401 after ~1 hour.
144+
145+
**Root cause**: Supabase JWTs expire after 1 hour. The MCP client should use the refresh token to get a new JWT. Check if the refresh flow works.
146+
147+
**Diagnosis**: Look for `grant_type=refresh_token` requests in logs and check their status codes.
148+
149+
### Pod can't reach Supabase JWKS
150+
151+
**Symptom**: JWKS fetch times out (logged as "JWKS fetch timed out (10s)").
152+
153+
**Diagnosis**: Test network from inside the pod:
154+
```bash
155+
kubectl exec -n <staging-namespace> deploy/<staging-namespace> -- python3 -c "
156+
import httpx, os
157+
r = httpx.get(f'{os.environ[\"SUPABASE_URL\"]}/auth/v1/.well-known/jwks.json', timeout=10)
158+
print(r.status_code, r.text[:200])
159+
"
160+
```
161+
162+
## Deploying a Fix
163+
164+
After updating secrets or config:
165+
166+
1. Commit and push to the branch
167+
2. Trigger the deploy workflow:
168+
```bash
169+
gh workflow run deploy-mcp.yaml -f branch=<branch> -f deploy_staging=true
170+
```
171+
3. Monitor the deployment:
172+
```bash
173+
kubectl rollout status deploy/<staging-namespace> -n <staging-namespace> --timeout=5m
174+
```
175+
4. Verify the fix by checking logs for successful auth flows
176+
177+
## Gotchas
178+
179+
- **No staging EveryRow API**: `values.staging.yaml` intentionally does NOT override `EVERYROW_API_URL`. Both environments use production.
180+
- **SOPS KMS access**: Decryption requires GCP IAM permissions on the KMS key referenced in the `.sops` metadata of each encrypted file. Run `gcloud auth application-default login` if it fails.
181+
- **Redis DB isolation**: Staging and production use different Redis DB indices (see `values.yaml` / `values.staging.yaml`). They share the same Redis Sentinel cluster.
182+
- **`_UNSAFE_decode_server_jwt`**: Hardcodes `algorithms=["RS256"]` but this is only for trusted server-to-server token inspection (signature verification disabled). It does NOT affect client-facing token verification.

.github/workflows/skill-version-check.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name: Skill Version Check
33
on:
44
pull_request:
55
paths:
6-
- "**/skills/**"
6+
- "skills/**"
77

88
jobs:
99
check-version-bump:

0 commit comments

Comments
 (0)