Skip to content

[FLINK-38536][tests] Fix data race in FinalizeOnMasterTest and ExecutionGraphFinishTest#28350

Open
MartijnVisser wants to merge 2 commits into
apache:masterfrom
MartijnVisser:FLINK-38536
Open

[FLINK-38536][tests] Fix data race in FinalizeOnMasterTest and ExecutionGraphFinishTest#28350
MartijnVisser wants to merge 2 commits into
apache:masterfrom
MartijnVisser:FLINK-38536

Conversation

@MartijnVisser
Copy link
Copy Markdown
Contributor

What is the purpose of the change

This pull request fixes an intermittent CI failure in FinalizeOnMasterTest
(reported in FLINK-38536,
e.g. build 75697)
and the same latent data race in its sibling ExecutionGraphFinishTest.

The flakiness is a test-harness defect, not a production bug. Both tests wired the
scheduler with ComponentMainThreadExecutorServiceAdapter.forMainThread() as the
JobManager main-thread executor while passing a separate single-threaded I/O
executor. forMainThread() is backed by a DirectScheduledExecutorService, whose
execute() runs the submitted command inline on the calling thread rather than
confining work to one dedicated main thread.

Execution#deploy() builds the TaskDeploymentDescriptor on the I/O executor via
CompletableFuture.supplyAsync(..., ioExecutor) and then composes the continuation
back to the main thread with thenComposeAsync(..., mainThreadExecutor). Because
forMainThread() executes inline, those continuations — TDD creation, task
submission, and the deployment-completion handling that can call markFailed — ran
on the I/O thread, concurrently with the test thread that was still inside
startScheduling() (and, in ExecutionGraphFinishTest, subsequently mutating state
via markFinished()). This unsynchronized concurrent mutation of the execution graph
produced the two observed signatures:

  • IllegalStateException: BUG: trying to schedule a region which is not in CREATED state
    (a region's vertices were mutated mid-scheduling), and
  • expected: RUNNING but was: FAILING (a background deployment callback failed an
    execution and triggered failover).

The async I/O-executor deploy path was introduced recently by
FLINK-38114 (asynchronous
offloading of TaskRestore), which is why these tests started flaking now. The
production scheduling code is unchanged; the fix is confined to the test harness.

Brief change log

  • FinalizeOnMasterTest: replace forMainThread() with a dedicated single-threaded
    JobManager main-thread executor (forSingleThreadExecutor over the existing
    TestingUtils.jmMainThreadExecutorExtension()), keeping the separate I/O executor.
    All execution-graph interactions are routed through runInMainThread /
    supplyInMainThread helpers so the asynchronous deployment callbacks are serialized
    with the test logic instead of racing on the I/O thread. The temporary debug logging
    added under FLINK-38536 by PR [FLINK-38536][tests] Add debugging for test failure #27168 is removed, as the root cause is now fixed.
  • ExecutionGraphFinishTest: apply the identical remedy to testJobFinishes(), which
    shares the same wiring and code path and is therefore exposed to the same race.

No assertion or functional test logic was changed in either test; this is purely a
threading-model fix in the test harness. The change mirrors the pattern already used
by ExecutionGraphSuspendTest and SchedulerTestingUtils.

Verifying this change

This change is already covered by the existing tests, which it stabilizes:

  • mvn test -pl flink-runtime -Dtest='FinalizeOnMasterTest,ExecutionGraphFinishTest'
    passes (3 tests, 0 failures), and spotless:check passes.
  • The original flake requires the concurrent I/O-thread timing of CI and is not
    reproducible locally in isolation, which is consistent with the diagnosed race; the
    fix removes the cross-thread access entirely rather than widening a timing window.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components),
    Checkpointing, Kubernetes/Yarn, ZooKeeper: no (test-only change)
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

AI-assisted contributions

  • Yes — generative AI tooling was used (Claude Code, Opus 4.8).

Generated-by: Claude Opus 4.8 (1M context)

forMainThread() runs the async deployment callbacks inline on the I/O thread, racing
with the test thread that schedules the execution graph. Run all execution graph
interactions on a dedicated single-threaded main-thread executor instead, and drop
the temporary debug logging from PR apache#27168.

Generated-by: Claude Opus 4.8 (1M context)
testJobFinishes() shares the forMainThread() wiring and the same deployment-callback
race fixed in FinalizeOnMasterTest. Apply the same fix.

Generated-by: Claude Opus 4.8 (1M context)
@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Jun 7, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants