Fix dispatcher reliability: capability parsing, crash-on-prom-failure, validation, and persistence#5196
Fix dispatcher reliability: capability parsing, crash-on-prom-failure, validation, and persistence#5196jmguzik wants to merge 1 commit into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughCentralizes capability extraction to ChangesDispatcher Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 11 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jmguzik The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…, validation, and persistence - Unify capability extraction into a single exported ExtractCapabilities function in pkg/dispatcher. - Change logrus.Fatal to logrus.Error+return when GetJobVolumes fails. A transient Prometheus timeout no longer kills the entire dispatcher process. - Fix Validate() off-by-one: len(matches) > 1 should be > 0, so a single duplicated job name across groups is now caught. Also call Validate() from LoadConfig() so it runs in production, not only tests. - Skip Regenerate and downstream side-effects when dispatchJobs returns nil (no build farm clusters configured) instead of wiping all in-memory state. This prevents the HTTP API from either serving stale assignments to removed clusters or returning 404 for every job. - Persist delta dispatch results to the gob file so assignments survive process restarts. - Make gob writes atomic via temp file + rename to prevent corruption on crash mid-write. Signed-off-by: Jakub Guzik <jguzik@redhat.com>
|
/retest |
|
Scheduling tests matching the |
|
@jmguzik: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Dispatcher Reliability Fixes
This PR fixes multiple correctness and reliability issues in the Prow Job Dispatcher (the service that assigns Prow jobs to OpenShift CI build clusters), with practical effects for CI operators and users:
Unified capability parsing
Prevent dispatcher process exits on transient Prometheus failures
Stronger, production-run validation
Preserve existing assignments when no build-farm clusters are configured
Durable, atomic persistence of dispatch results
Overall impact: dispatcher restarts and transient monitoring failures are far less likely to cause lost assignments or process downtime; config issues are detected earlier; capability-based assignment is consistent and deterministic; on-disk assignment cache is durable and crash-resistant.