Skip to content

[FLINK-39879][tests] Stop CheckpointAcknowledgeFailureITCase hanging on slow CI#28351

Open
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:fix-checkpoint-ack-failure-hang
Open

[FLINK-39879][tests] Stop CheckpointAcknowledgeFailureITCase hanging on slow CI#28351
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:fix-checkpoint-ack-failure-hang

Conversation

@MartijnVisser
Copy link
Copy Markdown
Contributor

What is the purpose of the change

CheckpointAcknowledgeFailureITCase.testCheckpointAckFailure can hang the entire surefire fork on a slow CI machine until the 900s no-output watchdog kills it (observed on master, build 75716). The test sets a deliberately tiny pekko.ask.timeout so the oversized checkpoint-ACK RPC times out (load-bearing for the AskTimeoutException assertion), but that timeout applies to every cluster RPC. On a slow agent an unrelated RPC (e.g. task deployment) can time out and fail the job terminally before the keyed state is updated, so the future the test waits on never completes and the unbounded wait wedges the fork.

This makes the wait fail fast with the real cause and adds a hard timeout so the test can never hang a fork again. It is a test-only stabilization; the assertion is unchanged.

Brief change log

  • Propagate a terminal job failure into stateUpdatedFuture via a whenComplete handler on getJobExecutionResult() (unwrapping the CompletionException to the real cause), so the wait fails fast instead of blocking forever.
  • executeJobAsync now returns the JobClient so the test can observe job termination.
  • Add @Timeout(5, MINUTES) as the hard anti-hang guard (consistent with ApproximateLocalRecoveryDownstreamITCase in the same package).

Verifying this change

This change is a test stabilization of an existing test; the assertion (triggerCheckpoint(...) must throw CheckpointException caused by AskTimeoutException) is unchanged. Verified locally:

  • Happy path passes: mvn -pl flink-tests -Dtest=CheckpointAcknowledgeFailureITCase testTests run: 1, Failures: 0.
  • Fail-fast path: temporarily forcing the ask timeout very low so an unrelated RPC fails the job before the state update — the test now fails fast (~24s) with the job's real JobExecutionException instead of hanging until the watchdog kill.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change)
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?
  • Yes (Claude Opus 4.8 (1M context))

Generated-by: Claude Opus 4.8 (1M context)

…on slow CI

The test waited on an unbounded future that never completes when the tiny
pekko.ask.timeout (load-bearing for the AskTimeoutException assertion) fails
the job before the keyed state is updated, hanging the surefire fork until
the CI watchdog kills it. Propagate a terminal job failure into the wait so
the test fails fast with the real cause, and add @timeout(5, MINUTES) as the
hard anti-hang guard. The product-side follow-up is tracked as FLINK-39738.

Generated-by: Claude Opus 4.8 (1M context)
@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Jun 7, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants