diff --git a/.wip/tasks/draft-protect-quote-wire-crawlers.md b/.wip/tasks/draft-protect-quote-wire-crawlers.md new file mode 100644 index 0000000..ed505d6 --- /dev/null +++ b/.wip/tasks/draft-protect-quote-wire-crawlers.md @@ -0,0 +1,41 @@ +--- +status: draft +--- + +# Protect the quote wire from crawler loops + +## Problem + +The Elsewhere quote wire creates procedural journey URLs with query parameters. After +release, ClaudeBot and OpenAI crawlers followed the quote wire deeply enough to burn +through the Cloudflare Workers allotment. + +The site should still expose the quote index and clean quote pages for normal discovery, +but crawlers should not be invited into the infinite-looking journey space. + +## Proposed solution + +Keep the public quote index crawlable while making the procedural journey URLs +crawler-hostile and budget-safe. Clean archive and quote URLs should remain available, but +query-state journey URLs should not be indexed or followed. + +Start with a repo-level fix before adding Cloudflare custom rules. Robots directives +should explicitly keep crawlers out of `/elsewhere/quotes/*?*`, with specific handling for +ClaudeBot and OpenAI crawlers where useful. Journey links should signal `nofollow`, and +journey pages should keep canonical URLs pointed at their clean quote path. + +Cloudflare-side protection should remain a fallback if robots directives and link hints do +not reduce abusive crawler traffic enough. + +## Requirements + +- `/elsewhere/quotes`, `/elsewhere/quotes/archive`, and clean `/elsewhere/quotes/{slug}` + pages should remain accessible. +- Procedural journey URLs with query parameters should be disallowed for crawlers. +- Journey continuation links should use `rel="nofollow"`. +- Journey pages should remain `noindex, nofollow` and canonicalize to the clean quote URL. +- ClaudeBot and OpenAI crawler traffic should be handled explicitly in robots directives. +- Do not add Cloudflare custom rules unless the repo-level crawl controls prove + insufficient. +- The first fix may rely on crawler politeness, but it should leave a clear escalation + path for Worker budget protection. diff --git a/src/elsewhere/quotes/QuotePage.astro b/src/elsewhere/quotes/QuotePage.astro index f46e665..ab7056f 100644 --- a/src/elsewhere/quotes/QuotePage.astro +++ b/src/elsewhere/quotes/QuotePage.astro @@ -70,6 +70,7 @@ const { Content } = quote; {doorWords[index]} diff --git a/src/pages/robots.txt.ts b/src/pages/robots.txt.ts index 16ef4c7..8c9ed2b 100644 --- a/src/pages/robots.txt.ts +++ b/src/pages/robots.txt.ts @@ -1,7 +1,26 @@ import { env } from "cloudflare:workers"; -const productionRobots = `User-agent: * +const quoteJourneyDisallow = "Disallow: /elsewhere/quotes/*?*"; + +const productionRobots = `User-agent: ClaudeBot +Allow: / +${quoteJourneyDisallow} + +User-agent: GPTBot +Allow: / +${quoteJourneyDisallow} + +User-agent: ChatGPT-User +Allow: / +${quoteJourneyDisallow} + +User-agent: OAI-SearchBot +Allow: / +${quoteJourneyDisallow} + +User-agent: * Allow: / +${quoteJourneyDisallow} Sitemap: https://johnhooks.io/sitemap.xml `;