Skip WAF breaker on recoverable CF challenges; add pacer warm-up#390
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughPR adds two crawler improvements: Cloudflare WAF now treats certain ChangesCloudflare WAF Recoverable Mitigations & Pacer Warmup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Updates to Preview Branch (work/modest-heisenberg-87c48b) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Release VersionsApp patch: ChangelogFixed
Added
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/crawler/waf_test.go (1)
114-154: ⚡ Quick winAdd regression cases for remaining recoverable tokens and normalisation.
Please add table rows for
Cf-Mitigated=jschallenge,Cf-Mitigated=rate_limited, and a mixed-case/whitespace variant (for example" Managed_Challenge "). This locks in all branches introduced by the new normalisation/membership logic.Suggested test additions
{ + name: "cloudflare — cf-mitigated=jschallenge on 403 is recoverable", + status: http.StatusForbidden, + headers: http.Header{ + "Cf-Mitigated": []string{"jschallenge"}, + "Server": []string{"cloudflare"}, + }, + body: []byte("challenge page"), + wantBlocked: false, + }, + { + name: "cloudflare — cf-mitigated=rate_limited on 429 is recoverable", + status: http.StatusTooManyRequests, + headers: http.Header{ + "Cf-Mitigated": []string{"rate_limited"}, + "Server": []string{"cloudflare"}, + }, + body: []byte("rate limited"), + wantBlocked: false, + }, + { + name: "cloudflare — cf-mitigated normalisation (case/space) is recoverable", + status: http.StatusForbidden, + headers: http.Header{ + "Cf-Mitigated": []string{" Managed_Challenge "}, + "Server": []string{"cloudflare"}, + }, + body: []byte("checking your browser"), + wantBlocked: false, + }, + { name: "cloudflare — cf-mitigated=block on 403 is a hard block", status: http.StatusForbidden, headers: http.Header{🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/crawler/waf_test.go` around lines 114 - 154, Add three additional table-driven test cases in the same test slice used in waf_test.go (the table entries shown) to cover the remaining recoverable tokens and normalization: one with "Cf-Mitigated" = "jschallenge" and status 403 expecting wantBlocked=false, one with "Cf-Mitigated" = "rate_limited" and status 429 expecting wantBlocked=false, and one with a mixed-case/whitespace variant like " Managed_Challenge " (or "Managed_Challenge" with odd casing/spaces) to assert the normalization logic treats it as recoverable (wantBlocked=false); for the hard-block case style tests, also include appropriate wantVendor=WAFVendorCloudflare and reasonPrefix assertions where applicable to mirror the existing "block" entry pattern so all branches introduced by the normalization/membership checks are exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/crawler/waf_test.go`:
- Around line 114-154: Add three additional table-driven test cases in the same
test slice used in waf_test.go (the table entries shown) to cover the remaining
recoverable tokens and normalization: one with "Cf-Mitigated" = "jschallenge"
and status 403 expecting wantBlocked=false, one with "Cf-Mitigated" =
"rate_limited" and status 429 expecting wantBlocked=false, and one with a
mixed-case/whitespace variant like " Managed_Challenge " (or "Managed_Challenge"
with odd casing/spaces) to assert the normalization logic treats it as
recoverable (wantBlocked=false); for the hard-block case style tests, also
include appropriate wantVendor=WAFVendorCloudflare and reasonPrefix assertions
where applicable to mirror the existing "block" entry pattern so all branches
introduced by the normalization/membership checks are exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 86410889-8e14-4ca1-b59c-bc527a597055
📒 Files selected for processing (4)
CHANGELOG.mdinternal/crawler/waf.gointernal/crawler/waf_test.gointernal/jobs/stream_worker.go
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-390.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-390.fly.dev |
Skip WAF breaker on recoverable CF challenges; add pacer warm-up
Summary
Cf-Mitigatedvalues (challenge,jschallenge,managed_challenge,rate_limited). Onlyblock(and unknown values, conservatively) still trips. The 403/429 status code still drives pacer back-off, so the job stays alive and the host gets slowed down instead of terminated.GNH_PACER_WARMUP_DELAY_MS(default 2000) seedsadaptive_delay_msfor never-crawled domains (DB column is NULL). ActivatesDomainPacer.EffectiveCapfrom the first dispatch so a fresh sitemap can't burst N requests at a CF-fronted host before the pacer has learned anything. Steps down via the existing success path; never re-applies once write-back has stored a learned value.Why
A merrypeople.com run (job
e1b38f68-…) took 42h to complete 178/1221 tasks at ~14m/task. Worker logs from a restart showed the WAF breaker tripping after ~10 minutes withvendor=cloudflare reason="cf-mitigated header present on 403" threshold=2. The Shopify storefront sits behind CF and intermittently challenges Fly's egress IPs — the existing detector treated any non-emptyCf-Mitigatedon non-200 as a hard block, so two consecutive challenges (amongst many successes) terminated the job.Test plan
go test ./...— full suite passes locally.mainpost-merge and verify it runs to completion rather than tripping the breaker.adaptive_delay_ms=2000in Redis (HGET hover:dom:cfg:<domain> adaptive_delay_ms) and converges down via the success path.domains.adaptive_delay_secondsare unchanged on restart.Need help on this PR? Tag
@codesmithwith what you need.Summary by CodeRabbit
Bug Fixes
New Features