Skip to content

Fix dispatcher reliability: capability parsing, crash-on-prom-failure, validation, and persistence#5196

Open
jmguzik wants to merge 1 commit into
openshift:mainfrom
jmguzik:dispatcher_fixes
Open

Fix dispatcher reliability: capability parsing, crash-on-prom-failure, validation, and persistence#5196
jmguzik wants to merge 1 commit into
openshift:mainfrom
jmguzik:dispatcher_fixes

Conversation

@jmguzik
Copy link
Copy Markdown
Contributor

@jmguzik jmguzik commented May 21, 2026

  • Unify capability extraction into a single exported ExtractCapabilities function in pkg/dispatcher.
  • Change logrus.Fatal to logrus.Error+return when GetJobVolumes fails. A transient Prometheus timeout no longer kills the entire dispatcher process.
  • Fix Validate() off-by-one: len(matches) > 1 should be > 0, so a single duplicated job name across groups is now caught. Also call Validate() from LoadConfig() so it runs in production, not only tests.
  • Skip Regenerate and downstream side-effects when dispatchJobs returns nil (no build farm clusters configured) instead of wiping all in-memory state. This prevents the HTTP API from either serving stale assignments to removed clusters or returning 404 for every job.
  • Persist delta dispatch results to the gob file so assignments survive process restarts.
  • Make gob writes atomic via temp file + rename to prevent corruption on crash mid-write.

Dispatcher Reliability Fixes

This PR fixes multiple correctness and reliability issues in the Prow Job Dispatcher (the service that assigns Prow jobs to OpenShift CI build clusters), with practical effects for CI operators and users:

  • Unified capability parsing

    • Introduces a single exported ExtractCapabilities(labels map[string]string) []string in pkg/dispatcher. Capabilities are deterministically sorted and used everywhere the dispatcher matches cluster capabilities to job requirements, removing duplicated parsing logic and making capability-driven assignment consistent.
  • Prevent dispatcher process exits on transient Prometheus failures

    • Replaces a logrus.Fatal call when GetJobVolumes fails with logging + early return. A transient Prometheus timeout or similar error will no longer terminate the dispatcher process.
  • Stronger, production-run validation

    • Fixes an off-by-one in Validate() so a single duplicated job name across groups is now treated as an error (len(matches) > 0). LoadConfig() now invokes Validate(), so validation runs during normal production config loading (not only in tests), catching configuration mistakes earlier.
  • Preserve existing assignments when no build-farm clusters are configured

    • If dispatchJobs returns nil (indicating no build-farm clusters configured), the dispatcher now skips regeneration and downstream side-effects instead of clearing in-memory state. This prevents the HTTP API from serving 404s or losing previously cached assignments when cluster configuration is temporarily missing.
  • Durable, atomic persistence of dispatch results

    • Delta dispatch results are now persisted to the Gob cache so incremental assignment updates survive process restarts (assignments are no longer lost between full dispatch runs).
    • Gob writes are performed atomically by writing to a temp file and renaming it into place, preventing cache corruption if the dispatcher crashes mid-write.

Overall impact: dispatcher restarts and transient monitoring failures are far less likely to cause lost assignments or process downtime; config issues are detected earlier; capability-based assignment is consistent and deterministic; on-disk assignment cache is durable and crash-resistant.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0f086266-c5eb-4879-8930-064e81aea90f

📥 Commits

Reviewing files that changed from the base of the PR and between 64f1845 and 70af06a.

📒 Files selected for processing (3)
  • cmd/prow-job-dispatcher/main.go
  • pkg/dispatcher/config.go
  • pkg/dispatcher/gob.go

📝 Walkthrough

Walkthrough

Centralizes capability extraction to ExtractCapabilities, implements atomic gob writes with temp-file+rename, persists regenerated job assignments to Gob cache, and tightens config validation and error handling.

Changes

Dispatcher Refactoring

Layer / File(s) Summary
Capability extraction refactor
pkg/dispatcher/config.go, cmd/prow-job-dispatcher/main.go
Remove local extractCapabilities; add exported ExtractCapabilities(labels map[string]string) []string; replace inline capability extraction in presubmits/postsubmits/periodics, cluster-volume population, and DetermineClusterForJob.
Config validation adjustments
pkg/dispatcher/config.go
Call config.Validate() from LoadConfig() after preprocessing; tighten duplicate-job-name check to error whenever duplicates exist.
Gob atomic write changes
pkg/dispatcher/gob.go
Add fmt and path/filepath imports; simplify ReadGob; implement atomic WriteGob using a temp file in the target directory, encode, close, rename, and cleanup with contextual errors.
Dispatcher cache persistence & error handling
cmd/prow-job-dispatcher/main.go
After Regenerate(pjs) call dispatcher.WriteGob(o.jobsStoragePath, pjs) (log-only errors); change job-volume fetch failure from Fatal to log-and-return; add guard to skip regeneration when dispatchJobs returns nil assignments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

approved

Suggested reviewers

  • danilo-gemoli
  • smg247
🚥 Pre-merge checks | ✅ 11 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Go Error Handling ⚠️ Warning WriteGob in gob.go line 33 ignores tmp.Close() error without justification. While os.Remove() in cleanup paths is acceptable, tmp.Close() errors should be checked. Check tmp.Close() error: change 'tmp.Close()' to 'if err := tmp.Close(); err != nil { os.Remove(tmpName); return fmt.Errorf(..., err) }' or log the error appropriately.
Test Coverage For New Features ⚠️ Warning Critical new functions lack tests: ExtractCapabilities, atomic WriteGob, nil dispatchJobs handling, gob persistence, error handling. No regression test for validation fix. Add tests for ExtractCapabilities, WriteGob atomicity/error handling, dispatchJobs nil return, gob persistence in dispatchDeltaWrapper, GetJobVolumes error handling, and single job name duplicate regression test.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main changes: capability parsing unification, removing fatal crash on Prometheus failures, validation fixes, and persistence improvements to the dispatcher.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR modifies dispatcher code files but does not contain any Ginkgo tests. The project uses standard Go testing.T tests only, not Ginkgo. The check does not apply.
Test Structure And Quality ✅ Passed PR adds standard Go tests (testing.T framework), not Ginkgo tests. Custom check is specifically for Ginkgo test code, making it inapplicable to this PR.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. Changes are limited to unit tests using Go's standard testing package and dispatcher code refactoring.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies only dispatcher logic files (main.go, config.go, gob.go) with no new Ginkgo e2e tests added. The check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only Go backend code (dispatcher utilities, gob serialization, config loading). No deployment manifests, operator code, controllers, or Kubernetes scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed prow-job-dispatcher is not an OTE binary - it's a standard Prow dispatcher tool, not a test extension. The check does not apply to non-OTE binaries.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR contains no Ginkgo e2e test additions. Files modified are dispatcher library and binary code, not e2e tests, so the IPv6/disconnected network check does not apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from bear-redhat and psalajova May 21, 2026 08:19
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmguzik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2026
…, validation, and persistence

- Unify capability extraction into a single exported ExtractCapabilities
  function in pkg/dispatcher.
- Change logrus.Fatal to logrus.Error+return when GetJobVolumes fails.
  A transient Prometheus timeout no longer kills the entire dispatcher
  process.
- Fix Validate() off-by-one: len(matches) > 1 should be > 0, so a
  single duplicated job name across groups is now caught. Also call
  Validate() from LoadConfig() so it runs in production, not only tests.
- Skip Regenerate and downstream side-effects when dispatchJobs returns
  nil (no build farm clusters configured) instead of wiping all
  in-memory state. This prevents the HTTP API from either serving stale
  assignments to removed clusters or returning 404 for every job.
- Persist delta dispatch results to the gob file so assignments survive
  process restarts.
- Make gob writes atomic via temp file + rename to prevent corruption on
  crash mid-write.

Signed-off-by: Jakub Guzik <jguzik@redhat.com>
@jmguzik jmguzik force-pushed the dispatcher_fixes branch from 64f1845 to 70af06a Compare May 21, 2026 08:35
@jmguzik
Copy link
Copy Markdown
Contributor Author

jmguzik commented May 21, 2026

/retest

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@jmguzik: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant