[CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately. by saurabhd336 · Pull Request #3644 · apache/celeborn

saurabhd336 · 2026-04-02T12:24:40Z

What changes were proposed in this pull request?

Often times, celeborn is too late in detecting diskfull issues simply because the DiskInfo's usableSpace is updated asynchronously in the worker heartbeat flow.
In such cases, if heartbeats are missed and / or multiple highly large writers end up pushing too much data to memory buffers (bypassing the diskfull based HARD_SPLIT checks), it can cause severe degradation.

In some cases we've noticed that we easily breach the configured disk usage limit, causing job degradations, cleanup failures (due to rocksdb sharing the disk with shuffle data) which makes the situation even worse.

This change proposes a more realtime, coordinated acquisition during flush, making the disk full detection full proof preventing any spillage beyond the configured limits.

Additionally, when sorting partition files, currently the extra disk space used isn't accounted for at all. This PR also changes the logic to account for disk space used / reclaimed during the file sorting process.

Everything behind a new config celeborn.worker.disk.storage.strictReserve.enabled, currently default set to false.

Why are the changes needed?

Disk full detection is not full proof

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

UTs added

zaynt4606 · 2026-04-07T11:38:40Z

Please create a jira to tag this pr @saurabhd336
https://issues.apache.org/jira/projects/CELEBORN/issues

saurabhd336 · 2026-04-13T06:06:25Z

@zaynt4606 I've created and attached the JIRA.

Could you please help review / assign reviewers.

Just for context, we have seen issues where the async nature of the disk usage update can delay HARD_SPLITs to the point where were completely run out of disk space. For our setup, this can lead to rather serious degradation. This config based feature will help us be more accurate with our disk usage accounting.

cc: @s0nskar

…ker heartbeat cycle for usable space updation

RexXiong · 2026-04-14T13:38:44Z

#3653 has already been merged, which fixes the core issue: when disk is full, returning HARD_SPLIT immediately instead of SOFT_SPLIT to prevent further writes.

The root cause described in #3653 is that SOFT_SPLIT mode allows continuous writing even when disk is full, which eventually fills up the reserved space. This PR's approach (real-time space tracking) addresses a different (and likely less critical) problem around heartbeat update latency.

Given #3653 already solves the core disk-full issue with a simpler fix, I'm not sure if the complexity introduced by this PR (thread-safety concerns in setUsableSpace, exception handling in acquireBytes) is still justified.

cc @SteNicholas @zaynt4606

saurabhd336 · 2026-04-15T06:52:06Z

@RexXiong I agree #3653 helps with the disk limit breach problems. And yes there are overheads associated with this accurate disk usage accounting (CAS contentions across eventLoop / flush / sorter threads).

For that reason I've put all of this behind a feature flag, disabled by default. We plan to enable this only for some very disk usage sensitive cluster (tier1 clusters)

But happy to take feedback from others if they see value in having this accurate accounting.

github-actions bot added module:common module:worker labels Apr 2, 2026

zaynt4606 changed the title ~~Realtime usage update oss~~ [WIP] Realtime usage update oss Apr 7, 2026

github-actions bot added the kind:documentation label Apr 13, 2026

saurabhd336 changed the title ~~[WIP] Realtime usage update oss~~ [CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately. Apr 13, 2026

saurabhd336 added 15 commits April 13, 2026 15:26

Improve disk full detection in PushHandler by avoiding relying on wor…

3040b54

…ker heartbeat cycle for usable space updation

Add comment

36224f7

Add config

f42f701

Lint

816c497

Release during sort

b27e319

Fix dir

fbdb7c8

Fix compilation

2ac1e79

Add fileInfo test

34a30e4

Add tests

497e77b

Streamline acquisition

e8b0dc4

Add license

70bf450

Fix test

49c56c0

Fix test

46f0e8d

Fix spark2 compilation

27b04c3

Fix compilation

2f0bc53

saurabhd336 force-pushed the realtimeUsageUpdateOss branch from f957fdc to 2f0bc53 Compare April 13, 2026 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately.#3644

[CELEBORN-2307] Support accurate disk usage accounting to HARD_SPLIT accurately.#3644
saurabhd336 wants to merge 15 commits intoapache:mainfrom
saurabhd336:realtimeUsageUpdateOss

saurabhd336 commented Apr 2, 2026 •

edited

Loading

Uh oh!

zaynt4606 commented Apr 7, 2026

Uh oh!

saurabhd336 commented Apr 13, 2026

Uh oh!

RexXiong commented Apr 14, 2026

Uh oh!

saurabhd336 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saurabhd336 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zaynt4606 commented Apr 7, 2026

Uh oh!

saurabhd336 commented Apr 13, 2026

Uh oh!

RexXiong commented Apr 14, 2026

Uh oh!

saurabhd336 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saurabhd336 commented Apr 2, 2026 •

edited

Loading