[CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately.#3644
[CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately.#3644saurabhd336 wants to merge 15 commits intoapache:mainfrom
Conversation
|
Please create a jira to tag this pr @saurabhd336 |
|
@zaynt4606 I've created and attached the JIRA. Could you please help review / assign reviewers. Just for context, we have seen issues where the async nature of the disk usage update can delay HARD_SPLITs to the point where were completely run out of disk space. For our setup, this can lead to rather serious degradation. This config based feature will help us be more accurate with our disk usage accounting. cc: @s0nskar |
…ker heartbeat cycle for usable space updation
f957fdc to
2f0bc53
Compare
|
#3653 has already been merged, which fixes the core issue: when disk is full, returning The root cause described in #3653 is that Given #3653 already solves the core disk-full issue with a simpler fix, I'm not sure if the complexity introduced by this PR (thread-safety concerns in |
|
@RexXiong I agree #3653 helps with the disk limit breach problems. And yes there are overheads associated with this accurate disk usage accounting (CAS contentions across eventLoop / flush / sorter threads). For that reason I've put all of this behind a feature flag, disabled by default. We plan to enable this only for some very disk usage sensitive cluster (tier1 clusters) But happy to take feedback from others if they see value in having this accurate accounting. |
What changes were proposed in this pull request?
Often times, celeborn is too late in detecting diskfull issues simply because the DiskInfo's
usableSpaceis updated asynchronously in the worker heartbeat flow.In such cases, if heartbeats are missed and / or multiple highly large writers end up pushing too much data to memory buffers (bypassing the diskfull based HARD_SPLIT checks), it can cause severe degradation.
In some cases we've noticed that we easily breach the configured disk usage limit, causing job degradations, cleanup failures (due to rocksdb sharing the disk with shuffle data) which makes the situation even worse.
This change proposes a more realtime, coordinated acquisition during flush, making the disk full detection full proof preventing any spillage beyond the configured limits.
Additionally, when sorting partition files, currently the extra disk space used isn't accounted for at all. This PR also changes the logic to account for disk space used / reclaimed during the file sorting process.
Everything behind a new config
celeborn.worker.disk.storage.strictReserve.enabled, currently default set tofalse.Why are the changes needed?
Disk full detection is not full proof
Does this PR resolve a correctness bug?
No
Does this PR introduce any user-facing change?
No
How was this patch tested?
UTs added