[FLINK-36753][runtime]Adaptive Scheduler actively triggers a Checkpoint after all resources are ready#27921
[FLINK-36753][runtime]Adaptive Scheduler actively triggers a Checkpoint after all resources are ready#27921Samrat002 wants to merge 1 commit into
Conversation
|
@1996fanrui PTAL whenever time. |
pnowojski
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I've left a couple of comments, however I don't have context to review whether this is properly integrated with AdatpiveScheduler and DefaultStateTransitionManager. Would be great for someone else to take a look as well.
| } | ||
|
|
||
| @Test | ||
| void testRescaleWithActiveCheckpointTrigger( |
There was a problem hiding this comment.
Have you made sure that this test is failing without your change?
Also, I don't see this test enabling SCHEDULER_RESCALE_TRIGGER_ACTIVE_CHECKPOINT_ENABLED anywhere? How is it passing right now?
There was a problem hiding this comment.
I have updated the tests with right details. PTAL
There was a problem hiding this comment.
@Samrat002 , just to double confirm, is this test testRescaleWithActiveCheckpointTrigger failing without your fix?
| } | ||
|
|
||
| @Test | ||
| void testRescaleWithActiveCheckpointTrigger( |
|
@flinkbot run azure |
|
@ztison @pnowojski PTAL . i have addressed to review comments added Unit tests , made the IT more robust and ensured minpause is respected |
Thanks for incorporating our improvements. I was on a vacation the last few days so I haven't responded. I am back, I will check the PR today or tomorrow. |
ztison
left a comment
There was a problem hiding this comment.
I see some issues with retry logic.
|
@ztison PTAL, I have addressed the latest review comments. |
37050e6 to
e9735a9
Compare
|
@pnowojski @XComp PTAL whenever time. |
XComp
left a comment
There was a problem hiding this comment.
Thanks for working on this. It's a great feature. I have a few comments. PTAL
b3661ff to
1f16c96
Compare
|
@XComp PTAL. I have addressed the review comments and tested it exclusively in the cluster. |
XComp
left a comment
There was a problem hiding this comment.
Thanks for addressing my comments. I did another pass over the change and added a few comments. PTAL :-)
b7ed160 to
6aebacc
Compare
XComp
left a comment
There was a problem hiding this comment.
Sorry for the delay. I managed to look over the PR once more. Just a few minor comments. PTAL
Please squash the commits and prepare the PR for merging. Good job! 👍
|
@flinkbot run azure intermediate maven pull failure |
XComp
left a comment
There was a problem hiding this comment.
Thanks for addressing my comments. Looks good from my side. 👍
@pnowojski anything to add from your side? I will merge the PR next week if nothing else is raised.
pnowojski
left a comment
There was a problem hiding this comment.
I have only one question: testRescaleWithActiveCheckpointTrigger
If test is failing in the expected way without your fix, LGTM
…nt after all resources are ready
|
hi @pnowojski , Thank you for asking to validate the changes for the negative case. I have created a draft change to showcase it #28347. Reference to failure ci/cd: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75747&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=1ffc5ec2-7913-50ff-0177-3fca16f1b8f0&l=10187 I validated that the test LMK, do you see any gaps |
What is the purpose of the change
FLIP-461 introduced checkpoint-synchronized rescaling where the Adaptive Scheduler waits for a checkpoint to complete before rescaling. However, it passively waits for the next periodic checkpoint, which can delay rescaling significantly when checkpoint intervals are large (e.g., 10 minutes).
This PR makes the Adaptive Scheduler actively trigger a checkpoint when resources change and rescaling is desired. The trigger fires at the right time. ie, when the
DefaultStateTransitionManagerenters the Stabilizing or Stabilized phase (i.e., when the resource gate is open and the scheduler is waiting for the checkpoint gate). The feature is controlled by a new configuration optionjobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabled(default: false).The feature respects
execution.checkpointing.min-pause, skips if a checkpoint is already in progress, and only fires when parallelism has actually changed.Brief change log
Verifying this change
End-to-end test on a real cluster
Verified the feature on a local 2-TaskManager standalone cluster running the
LargeStateGeneratorJobbenchmark with a deliberately long checkpoint interval, so any checkpoint firing within seconds must be the active trigger.Setup
Adaptivejobmanager.adaptive-scheduler.rescale-trigger.active-checkpoint.enabledtrueexecution.checkpointing.interval1 hexecution.checkpointing.min-pause0 sstate.backend.typehashmapLargeStateGeneratorJobat parallelism 2, ~16 MB keyed stateScenarios
Initial deploy at parallelism 2 —
Executing.requestActiveCheckpointTrigger()was called fromStabilizingentry but theparallelismChanged()guard correctly returned false; no active trigger fired.Scale up 2 → 4 (start 2nd TM, then
PUT /jobs/<id>/resource-requirementswithupperBound: 4) — active trigger fired immediately, checkpoint completed in 22 ms, rescale proceeded.Scale down 4 → 2 (
PUT … upperBound: 2) — same flow, 22 ms checkpoint, rescale proceeded.Grepped Log Lines
Entire JobManager logs
e2e-proof-jm.log
Does this pull request potentially affect one of the following parts:
Documentation