Fix dispatcher reliability: capability parsing, crash-on-prom-failure, validation, and persistence by jmguzik · Pull Request #5196 · openshift/ci-tools

jmguzik · 2026-05-21T08:13:33Z

Unify capability extraction into a single exported ExtractCapabilities function in pkg/dispatcher.
Change logrus.Fatal to logrus.Error+return when GetJobVolumes fails. A transient Prometheus timeout no longer kills the entire dispatcher process.
Fix Validate() off-by-one: len(matches) > 1 should be > 0, so a single duplicated job name across groups is now caught. Also call Validate() from LoadConfig() so it runs in production, not only tests.
Skip Regenerate and downstream side-effects when dispatchJobs returns nil (no build farm clusters configured) instead of wiping all in-memory state. This prevents the HTTP API from either serving stale assignments to removed clusters or returning 404 for every job.
Persist delta dispatch results to the gob file so assignments survive process restarts.
Make gob writes atomic via temp file + rename to prevent corruption on crash mid-write.

Dispatcher Reliability Fixes

This PR fixes multiple correctness and reliability issues in the Prow Job Dispatcher (the service that assigns Prow jobs to OpenShift CI build clusters), with practical effects for CI operators and users:

Unified capability parsing
- Introduces a single exported ExtractCapabilities(labels map[string]string) []string in pkg/dispatcher. Capabilities are deterministically sorted and used everywhere the dispatcher matches cluster capabilities to job requirements, removing duplicated parsing logic and making capability-driven assignment consistent.
Prevent dispatcher process exits on transient Prometheus failures
- Replaces a logrus.Fatal call when GetJobVolumes fails with logging + early return. A transient Prometheus timeout or similar error will no longer terminate the dispatcher process.
Stronger, production-run validation
- Fixes an off-by-one in Validate() so a single duplicated job name across groups is now treated as an error (len(matches) > 0). LoadConfig() now invokes Validate(), so validation runs during normal production config loading (not only in tests), catching configuration mistakes earlier.
Preserve existing assignments when no build-farm clusters are configured
- If dispatchJobs returns nil (indicating no build-farm clusters configured), the dispatcher now skips regeneration and downstream side-effects instead of clearing in-memory state. This prevents the HTTP API from serving 404s or losing previously cached assignments when cluster configuration is temporarily missing.
Durable, atomic persistence of dispatch results
- Delta dispatch results are now persisted to the Gob cache so incremental assignment updates survive process restarts (assignments are no longer lost between full dispatch runs).
- Gob writes are performed atomically by writing to a temp file and renaming it into place, preventing cache corruption if the dispatcher crashes mid-write.

Overall impact: dispatcher restarts and transient monitoring failures are far less likely to cause lost assignments or process downtime; config issues are detected earlier; capability-based assignment is consistent and deterministic; on-disk assignment cache is durable and crash-resistant.

openshift-merge-bot · 2026-05-21T08:13:37Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

coderabbitai · 2026-05-21T08:13:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0f086266-c5eb-4879-8930-064e81aea90f

📥 Commits

Reviewing files that changed from the base of the PR and between 64f1845 and 70af06a.

📒 Files selected for processing (3)

cmd/prow-job-dispatcher/main.go
pkg/dispatcher/config.go
pkg/dispatcher/gob.go

📝 Walkthrough

Walkthrough

Centralizes capability extraction to ExtractCapabilities, implements atomic gob writes with temp-file+rename, persists regenerated job assignments to Gob cache, and tightens config validation and error handling.

Changes

Dispatcher Refactoring

Layer / File(s)	Summary
Capability extraction refactor `pkg/dispatcher/config.go`, `cmd/prow-job-dispatcher/main.go`	Remove local `extractCapabilities`; add exported `ExtractCapabilities(labels map[string]string) []string`; replace inline capability extraction in presubmits/postsubmits/periodics, cluster-volume population, and DetermineClusterForJob.
Config validation adjustments `pkg/dispatcher/config.go`	Call `config.Validate()` from `LoadConfig()` after preprocessing; tighten duplicate-job-name check to error whenever duplicates exist.
Gob atomic write changes `pkg/dispatcher/gob.go`	Add `fmt` and `path/filepath` imports; simplify `ReadGob`; implement atomic `WriteGob` using a temp file in the target directory, encode, close, rename, and cleanup with contextual errors.
Dispatcher cache persistence & error handling `cmd/prow-job-dispatcher/main.go`	After `Regenerate(pjs)` call `dispatcher.WriteGob(o.jobsStoragePath, pjs)` (log-only errors); change job-volume fetch failure from Fatal to log-and-return; add guard to skip regeneration when `dispatchJobs` returns `nil` assignments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

approved

Suggested reviewers

danilo-gemoli
smg247

🚥 Pre-merge checks | ✅ 11 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Go Error Handling	⚠️ Warning	WriteGob in gob.go line 33 ignores tmp.Close() error without justification. While os.Remove() in cleanup paths is acceptable, tmp.Close() errors should be checked.	Check tmp.Close() error: change 'tmp.Close()' to 'if err := tmp.Close(); err != nil { os.Remove(tmpName); return fmt.Errorf(..., err) }' or log the error appropriately.
Test Coverage For New Features	⚠️ Warning	Critical new functions lack tests: ExtractCapabilities, atomic WriteGob, nil dispatchJobs handling, gob persistence, error handling. No regression test for validation fix.	Add tests for ExtractCapabilities, WriteGob atomicity/error handling, dispatchJobs nil return, gob persistence in dispatchDeltaWrapper, GetJobVolumes error handling, and single job name duplicate regression test.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main changes: capability parsing unification, removing fatal crash on Prometheus failures, validation fixes, and persistence improvements to the dispatcher.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	This PR modifies dispatcher code files but does not contain any Ginkgo tests. The project uses standard Go testing.T tests only, not Ginkgo. The check does not apply.
Test Structure And Quality	✅ Passed	PR adds standard Go tests (testing.T framework), not Ginkgo tests. Custom check is specifically for Ginkgo test code, making it inapplicable to this PR.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added in this PR. Changes are limited to unit tests using Go's standard testing package and dispatcher code refactoring.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR modifies only dispatcher logic files (main.go, config.go, gob.go) with no new Ginkgo e2e tests added. The check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only Go backend code (dispatcher utilities, gob serialization, config loading). No deployment manifests, operator code, controllers, or Kubernetes scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	prow-job-dispatcher is not an OTE binary - it's a standard Prow dispatcher tool, not a test extension. The check does not apply to non-OTE binaries.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR contains no Ginkgo e2e test additions. Files modified are dispatcher library and binary code, not e2e tests, so the IPv6/disconnected network check does not apply.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-05-21T08:20:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmguzik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jmguzik]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…, validation, and persistence - Unify capability extraction into a single exported ExtractCapabilities function in pkg/dispatcher. - Change logrus.Fatal to logrus.Error+return when GetJobVolumes fails. A transient Prometheus timeout no longer kills the entire dispatcher process. - Fix Validate() off-by-one: len(matches) > 1 should be > 0, so a single duplicated job name across groups is now caught. Also call Validate() from LoadConfig() so it runs in production, not only tests. - Skip Regenerate and downstream side-effects when dispatchJobs returns nil (no build farm clusters configured) instead of wiping all in-memory state. This prevents the HTTP API from either serving stale assignments to removed clusters or returning 404 for every job. - Persist delta dispatch results to the gob file so assignments survive process restarts. - Make gob writes atomic via temp file + rename to prevent corruption on crash mid-write. Signed-off-by: Jakub Guzik <jguzik@redhat.com>

jmguzik · 2026-05-21T11:46:24Z

/retest

openshift-merge-bot · 2026-05-21T12:27:25Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e

openshift-ci · 2026-05-21T13:34:28Z

@jmguzik: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from bear-redhat and psalajova May 21, 2026 08:19

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2026

jmguzik force-pushed the dispatcher_fixes branch from 64f1845 to 70af06a Compare May 21, 2026 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dispatcher reliability: capability parsing, crash-on-prom-failure, validation, and persistence#5196

Fix dispatcher reliability: capability parsing, crash-on-prom-failure, validation, and persistence#5196
jmguzik wants to merge 1 commit into
openshift:mainfrom
jmguzik:dispatcher_fixes

jmguzik commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-merge-bot Bot commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (3 warnings)

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

jmguzik commented May 21, 2026

Uh oh!

openshift-merge-bot Bot commented May 21, 2026

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jmguzik commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dispatcher Reliability Fixes

Uh oh!

openshift-merge-bot Bot commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (3 warnings)

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

jmguzik commented May 21, 2026

Uh oh!

openshift-merge-bot Bot commented May 21, 2026

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jmguzik commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading