Skip to content

feat: Add mergeBuffer/maxSpillProximity metric for groupBy spill diagnosis#19627

Open
aho135 wants to merge 1 commit into
apache:masterfrom
aho135:mergeBuffer-metrics
Open

feat: Add mergeBuffer/maxSpillProximity metric for groupBy spill diagnosis#19627
aho135 wants to merge 1 commit into
apache:masterfrom
aho135:mergeBuffer-metrics

Conversation

@aho135

@aho135 aho135 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Description

When a groupBy query runs, ConcurrentGrouper divides the single acquired merge buffer into druid.processing.numThreads equal slices (sliceSize = capacity / numThreads) and gives one slice to each processing thread. A query spills to disk as soon as its fullest single slice fills — at roughly sizeBytes / numThreads, which can be far below the configured druid.processing.buffer.sizeBytes.

The existing metrics do not let an operator see this. mergeBuffer/maxBytesUsed is a per-query sum across slices, further discounted by the hash-table load factor, so it never approaches sizeBytes even while queries are actively spilling — making it impossible to compare against druid.processing.buffer.sizeBytes or to reason about spill pressure.

Concretely, an operator with sizeBytes = 125 MiB and numThreads = 240 (slices ≈ 546 KiB) saw groupBy/spilledQueries climbing while mergeBuffer/maxBytesUsed sat around ~60 MB, which looks contradictory until you account for slicing.

Change

This PR adds mergeBuffer/maxSpillProximity, a dimensionless gauge in [0.0, 1.0]:

maxSpillProximity = maxSliceUsedBytes / (sliceSize × maxLoadFactor),  clamped to [0, 1]
                    ↑ MAX across a query's slices, then MAX across queries
  • It is computed per-slice and maxed (never summed), because the fullest slice is what actually triggers a spill.
  • The denominator is sliceSize × maxLoadFactor (default load factor 0.7), because a BufferHashGrouper spills when its bucket count reaches the load factor, not when the slice is byte-full. This makes 1.0 correspond to the real spill point.

Operators can read mergeBuffer/maxSpillProximity alongside groupBy/spilledQueries: a value near 1.0 means slices are saturating, and the fix is to widen each slice by raising druid.processing.buffer.sizeBytes or lowering druid.processing.numThreads.

Changed files

  • GroupByStatsProvider — track per-slice max used bytes and the per-slice spill threshold; add getSpillProximity() (clamped to [0,1]); aggregate as a max across queries.
  • SpillingGrouper — report each slice's peak usage against its spill threshold in close().
  • BufferHashGrouper — expose resolveMaxLoadFactor() so the metric denominator matches the grouper's actual spill decision (including the default-resolution rule).
  • GroupByStatsMonitor — emit mergeBuffer/maxSpillProximity.
  • docs/operations/metrics.md — document the new metric and clarify the slicing semantics of mergeBuffer/bytesUsed and mergeBuffer/maxBytesUsed.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description. (New metric mergeBuffer/maxSpillProximity; no behavior or config changes.)
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests. (N/A — metric is covered by unit tests.)
  • been tested in a test Druid cluster.

…nosis

The merge buffer is sliced into druid.processing.numThreads slices by
ConcurrentGrouper, and a groupBy query spills as soon as its single fullest
slice fills (~ sizeBytes/numThreads). Existing metrics could not explain this:
mergeBuffer/maxBytesUsed is a per-query SUM across slices, discounted by the
hash-table load factor, so it never approaches sizeBytes even while queries
spill, making it impossible to compare against druid.processing.buffer.sizeBytes.

Add mergeBuffer/maxSpillProximity, a dimensionless gauge in [0.0, 1.0]:
the fullest single slice's used bytes divided by its spill threshold
(sliceSize * maxLoadFactor), tracked as a max across slices and across queries.
1.0 means a query reached the point at which a slice spills to disk.

- GroupByStatsProvider: track per-slice max used bytes and spill threshold;
  expose getSpillProximity() (clamped to [0,1]); aggregate as a max.
- SpillingGrouper: report each slice's peak usage against its threshold.
- BufferHashGrouper: expose resolveMaxLoadFactor() so the denominator matches
  the grouper's actual spill decision.
- GroupByStatsMonitor: emit mergeBuffer/maxSpillProximity.
- Clarify mergeBuffer/bytesUsed and maxBytesUsed docs (slicing semantics).

Existing emitted metric names and values are unchanged.
@aho135 aho135 changed the title Add mergeBuffer/maxSpillProximity metric for groupBy spill diagnosis feat: Add mergeBuffer/maxSpillProximity metric for groupBy spill diagnosis Jun 24, 2026
@aho135 aho135 requested a review from GWphua June 24, 2026 18:23
@aho135

aho135 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Hey @GWphua! We were using the mergeBuffer/maxBytesUsed metric for tuning mergeBuffer size but found that it wasn't a very accurate indicator for spilling. Let me know if you have any thoughts on this. Thanks!

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1

Reviewed 7 of 7 changed files.


This is an automated review by Codex GPT-5.5

{
maxMergeBufferUsedBytes.addAndGet(bytes);
maxSliceUsedBytes.accumulateAndGet(usedBytes, Math::max);
sliceSpillThresholdBytes.accumulateAndGet(spillThresholdBytes, Math::max);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Track spill proximity as a per-slice ratio

This stores the maximum used bytes and maximum threshold independently, which breaks when one query reports groupers with different thresholds. Nested/subtotal processing can pass the same PerQueryStats through a sliced ConcurrentGrouper and later full-buffer SpillingGroupers; if a small slice reaches its spill threshold, the larger full-buffer threshold can be retained here and getSpillProximity() will divide by that larger value, under-reporting the slice spill as roughly 1 / concurrencyHint. Please track the maximum usedBytes / spillThresholdBytes per sliceUsage call, or otherwise keep the used/threshold pair together, so mixed grouper sizes still report the true max proximity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants