[FLINK-39879][tests] Stop CheckpointAcknowledgeFailureITCase hanging on slow CI by MartijnVisser · Pull Request #28351 · apache/flink

MartijnVisser · 2026-06-07T14:50:51Z

What is the purpose of the change

CheckpointAcknowledgeFailureITCase.testCheckpointAckFailure can hang the entire surefire fork on a slow CI machine until the 900s no-output watchdog kills it (observed on master, build 75716). The test sets a deliberately tiny pekko.ask.timeout so the oversized checkpoint-ACK RPC times out (load-bearing for the AskTimeoutException assertion), but that timeout applies to every cluster RPC. On a slow agent an unrelated RPC (e.g. task deployment) can time out and fail the job terminally before the keyed state is updated, so the future the test waits on never completes and the unbounded wait wedges the fork.

This makes the wait fail fast with the real cause and adds a hard timeout so the test can never hang a fork again. It is a test-only stabilization; the assertion is unchanged.

Brief change log

Propagate a terminal job failure into stateUpdatedFuture via a whenComplete handler on getJobExecutionResult() (unwrapping the CompletionException to the real cause), so the wait fails fast instead of blocking forever.
executeJobAsync now returns the JobClient so the test can observe job termination.
Add @Timeout(5, MINUTES) as the hard anti-hang guard (consistent with ApproximateLocalRecoveryDownstreamITCase in the same package).

Verifying this change

This change is a test stabilization of an existing test; the assertion (triggerCheckpoint(...) must throw CheckpointException caused by AskTimeoutException) is unchanged. Verified locally:

Happy path passes: mvn -pl flink-tests -Dtest=CheckpointAcknowledgeFailureITCase test → Tests run: 1, Failures: 0.
Fail-fast path: temporarily forcing the ask timeout very low so an unrelated RPC fails the job before the state update — the test now fails fast (~24s) with the job's real JobExecutionException instead of hanging until the watchdog kill.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change)
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?

Yes (Claude Opus 4.8 (1M context))

Generated-by: Claude Opus 4.8 (1M context)

@timeout

…on slow CI The test waited on an unbounded future that never completes when the tiny pekko.ask.timeout (load-bearing for the AskTimeoutException assertion) fails the job before the keyed state is updated, hanging the surefire fork until the CI watchdog kills it. Propagate a terminal job failure into the wait so the test fails fast with the real cause, and add @timeout(5, MINUTES) as the hard anti-hang guard. The product-side follow-up is tracked as FLINK-39738. Generated-by: Claude Opus 4.8 (1M context)

flinkbot · 2026-06-07T14:56:07Z

CI report:

4efce2b Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

MartijnVisser requested a review from rkhachatryan June 7, 2026 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39879][tests] Stop CheckpointAcknowledgeFailureITCase hanging on slow CI#28351

[FLINK-39879][tests] Stop CheckpointAcknowledgeFailureITCase hanging on slow CI#28351
MartijnVisser wants to merge 1 commit into
apache:masterfrom
MartijnVisser:fix-checkpoint-ack-failure-hang

MartijnVisser commented Jun 7, 2026

Uh oh!

flinkbot commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MartijnVisser commented Jun 7, 2026

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Was generative AI tooling used to co-author this PR?

Uh oh!

flinkbot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

flinkbot commented Jun 7, 2026 •

edited

Loading