Skip to content

Use robots.txt and mitigate AI bots#460

Open
cycomachead wants to merge 6 commits into
mainfrom
cycomachead/ai/26/1
Open

Use robots.txt and mitigate AI bots#460
cycomachead wants to merge 6 commits into
mainfrom
cycomachead/ai/26/1

Conversation

@cycomachead
Copy link
Copy Markdown
Member

@cycomachead cycomachead commented May 23, 2026

Update nginx config to reduce outbound traffic

  • robots.txt was present but never actually used previously, now serve it and throttle bots a bit
  • specifically block spam-type bots from querying Snap!
  • add nginx rate limits to a couple high-traffic endpoints, namely getting the project XML.

Changes

  • html/robots.txt — Replaces the old per-bot rules with a policy that blocks /project/ and /api/ for all crawlers while explicitly allowing the upcoming /project/*/users/* viewer route; adds Crawl-delay: 10 and a Sitemap: line (confirm with owner whether a real sitemap exists before merging)

  • nginx.conf.d/snap-bot-mitigation.conf (new) — http-context map that flags 8 SEO bots known to ignore robots.txt (ahrefsbot, semrushbot, dotbot, mj12bot, sleepbot, yeti, blexbot, petalbot); limit_req_zone for /project/<id> at 60 r/s; limit_req_status 429

  • nginx.conf.d/snap-bot-mitigation-server.conf (new) — server-context directives: location = /robots.txt serving the static file from the repo (no app dependency), and if ($snap_block_ua) { return 403; }

  • nginx.conf.d/locations.conf — Includes the server-context snippet (applies to both prod hosts, both staging hosts, and dev via the shared locations.conf); adds a location ~ ^/project/[0-9]+/?$ block with limit_req zone=snap_project burst=120 nodelay targeting the raw XML endpoint — the single largest egress source

  • nginx.conf — Includes snap-bot-mitigation.conf in http context

Reviewer notes

  • Rate limits are deliberately generous (60 r/s, burst 120) to avoid throttling classroom NAT IPs where 20–40 students share one public IP. Watch error.log for limiting requests lines after deploy and tighten only with evidence.
  • Well-behaved bots (Googlebot, bingbot, ChatGPT-User, Amazonbot, etc.) are not in the UA blocklist — robots.txt handles them.
  • The upcoming /project/<name>/users/<username> HTML route does not match the numeric-ID regex and is unaffected by rate limiting.
  • Deploy with nginx -t && systemctl reload nginx (graceful reload, not restart).

Superconductor Ticket Implementation | App Preview | Guided Review

cycomachead and others added 6 commits May 23, 2026 06:14
- Update robots.txt to disallow /project/ and /api/ for compliant crawlers.
- Add a user-agent blocklist to return 403 for aggressive SEO bots.
- Implement rate limiting (60r/s) specifically for the /project/:id XML
  endpoint to reduce egress costs from automated scraping.
- Serve robots.txt as a static file directly from the repository.

Co-authored-by: Claude Code <noreply@anthropic.com>
…ad/ai/26/1

* 'main' of github.com:snap-cloud/snapCloud:
  docs: move installation and deployment guides to docs directory
  rerun migrations
  feat: enhance compression logging and prevent stale CSS assets
  feat: implement pre-compression and global gzip configuration
Update internal UI and model call sites to use the explicit /api/v1/project/:id
path instead of the bare /project/:id route. This ensures legitimate
application traffic bypasses bot-mitigation rate limits targeting legacy
crawler entry points.

Co-authored-by: Claude Code <noreply@anthropic.com>
@cycomachead cycomachead changed the title Implement 3-layer bot mitigation in Snap! nginx config Use robots.txt and mitigate AI bots May 23, 2026

# UA blocklist for SEO crawlers that ignore robots.txt. \b word-boundaries
# guard against substring false positives (e.g. "yeti" inside a longer UA).
map $http_user_agent $snap_block_ua {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These bots were pull from the user agents of actual request logs for a few days in may.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant