[FLINK-38536][tests] Fix data race in FinalizeOnMasterTest and ExecutionGraphFinishTest#28350
Open
MartijnVisser wants to merge 2 commits into
Open
[FLINK-38536][tests] Fix data race in FinalizeOnMasterTest and ExecutionGraphFinishTest#28350MartijnVisser wants to merge 2 commits into
MartijnVisser wants to merge 2 commits into
Conversation
forMainThread() runs the async deployment callbacks inline on the I/O thread, racing with the test thread that schedules the execution graph. Run all execution graph interactions on a dedicated single-threaded main-thread executor instead, and drop the temporary debug logging from PR apache#27168. Generated-by: Claude Opus 4.8 (1M context)
testJobFinishes() shares the forMainThread() wiring and the same deployment-callback race fixed in FinalizeOnMasterTest. Apply the same fix. Generated-by: Claude Opus 4.8 (1M context)
3b5561e to
47bb05b
Compare
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
This pull request fixes an intermittent CI failure in
FinalizeOnMasterTest(reported in FLINK-38536,
e.g. build 75697)
and the same latent data race in its sibling
ExecutionGraphFinishTest.The flakiness is a test-harness defect, not a production bug. Both tests wired the
scheduler with
ComponentMainThreadExecutorServiceAdapter.forMainThread()as theJobManager main-thread executor while passing a separate single-threaded I/O
executor.
forMainThread()is backed by aDirectScheduledExecutorService, whoseexecute()runs the submitted command inline on the calling thread rather thanconfining work to one dedicated main thread.
Execution#deploy()builds theTaskDeploymentDescriptoron the I/O executor viaCompletableFuture.supplyAsync(..., ioExecutor)and then composes the continuationback to the main thread with
thenComposeAsync(..., mainThreadExecutor). BecauseforMainThread()executes inline, those continuations — TDD creation, tasksubmission, and the deployment-completion handling that can call
markFailed— ranon the I/O thread, concurrently with the test thread that was still inside
startScheduling()(and, inExecutionGraphFinishTest, subsequently mutating statevia
markFinished()). This unsynchronized concurrent mutation of the execution graphproduced the two observed signatures:
IllegalStateException: BUG: trying to schedule a region which is not in CREATED state(a region's vertices were mutated mid-scheduling), and
expected: RUNNING but was: FAILING(a background deployment callback failed anexecution and triggered failover).
The async I/O-executor deploy path was introduced recently by
FLINK-38114 (asynchronous
offloading of
TaskRestore), which is why these tests started flaking now. Theproduction scheduling code is unchanged; the fix is confined to the test harness.
Brief change log
FinalizeOnMasterTest: replaceforMainThread()with a dedicated single-threadedJobManager main-thread executor (
forSingleThreadExecutorover the existingTestingUtils.jmMainThreadExecutorExtension()), keeping the separate I/O executor.All execution-graph interactions are routed through
runInMainThread/supplyInMainThreadhelpers so the asynchronous deployment callbacks are serializedwith the test logic instead of racing on the I/O thread. The temporary debug logging
added under FLINK-38536 by PR [FLINK-38536][tests] Add debugging for test failure #27168 is removed, as the root cause is now fixed.
ExecutionGraphFinishTest: apply the identical remedy totestJobFinishes(), whichshares the same wiring and code path and is therefore exposed to the same race.
No assertion or functional test logic was changed in either test; this is purely a
threading-model fix in the test harness. The change mirrors the pattern already used
by
ExecutionGraphSuspendTestandSchedulerTestingUtils.Verifying this change
This change is already covered by the existing tests, which it stabilizes:
mvn test -pl flink-runtime -Dtest='FinalizeOnMasterTest,ExecutionGraphFinishTest'passes (3 tests, 0 failures), and
spotless:checkpasses.reproducible locally in isolation, which is consistent with the diagnosed race; the
fix removes the cross-thread access entirely rather than widening a timing window.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): noCheckpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change)
Documentation
AI-assisted contributions
Generated-by: Claude Opus 4.8 (1M context)