feat: introduce processing rate to max ratio as poll-idle-ratio replacement by Fly-Style · Pull Request #19622 · apache/druid

Fly-Style · 2026-06-23T20:57:25Z

Summary

Kafka's poll-idle-ratio is not the best possible metric to represent the idle cost for autoscaler - it reflects time spent polling, not whether the task has spare processing capacity. Also, it is not supported for Kinesis, which is still in play in read-world productions.

This patch adds an alternative idle signal : 1 - (avgProcessingRate / maxObservedRate), gated behind a new opt-in flag useUtilizationRatio (default false, so existing deployments are unaffected).

CostBasedAutoScaler: tracks a bounded watermark of the task's best-observed processing rate (maxObservedRate), feeding CostMetrics.
WeightedCostFunction: when the flag is on, derives idle ratio from that utilization ratio instead of pollIdleRatio; falls back to IDEAL_IDLE_RATIO until a watermark sample exists (cold start).
CostBasedAutoScalerConfig: new useUtilizationRatio boolean, wired through builder/serde/equals/hashCode.
CostMetrics: new nullable maxObservedRate field; old constructor kept (delegates with null) since three test call sites still use it.

This PR has:

been self-reviewed.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.

FrankChen021

Severity	Findings
P0	0
P1	0
P2	1
P3	0
Total	1

Severity	Findings
P0	0
P1	0
P2	1
P3	0
Total	1

Reviewed 8 of 8 changed files.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-06-24T12:14:30Z

+
+    final int lowInitialTaskCount = 1;
+    // This ensures tasks are busy processing (low idle ratio)
+    Executors.newSingleThreadExecutor().submit(() -> {


[P2] Shut down the background publisher executor

The new test creates a single-thread ExecutorService and immediately drops it. That executor uses a non-daemon worker and is never shut down, so the embedded-test JVM can stay alive after the test returns; on timeout or failure it can also keep publishing into the topic while cleanup is running. Keep a reference and shut it down/cancel the Future in a finally block or use an existing managed executor.

kfaraz

Conceptually, this makes sense. Haven't reviewed the tests yet.

kfaraz · 2026-06-25T08:01:52Z

+   * bottleneck, so the rate observed is the task's own ceiling. The max of these samples
+   * becomes {@link CostMetrics#getMaxObservedRate()}.
+   */
+  private final EvictingQueue<Double> lagGatedRateSamples;


Suggested change

private final EvictingQueue<Double> lagGatedRateSamples;

private final EvictingQueue<Double> lagGatedProcessingRateSamples;

Renamed in other way due to new circumstances.

kfaraz · 2026-06-25T08:15:18Z

+  /**
+   * Derives the current idle ratio from measured utilization ({@code avgProcessingRate / maxObservedRate}).
+   */
+  static double estimateIdleFromUtilization(CostMetrics metrics)


Maybe this method should exist in CostMetrics itself?

Please rename this method to estimateIdleRatioFromProcessingRate() and move it to the class CostMetrics.

kfaraz · 2026-06-25T08:18:38Z

+  {
+    final Double maxObservedRate = metrics.getMaxObservedRate();
+    if (maxObservedRate == null || maxObservedRate <= 0) {
+      return IDEAL_IDLE_RATIO;


So, if there has been no lag for a while, we will assume that we are already in the optimal usage band?
This would mean that scale down would not happen when, ideally, it should.

I wonder if it would be better to collect samples of avgProcessingRate even when there is no lag, and simply use the max to determine the max observed rate.

This is a slippery place, but after short consideration I'd agree with you. Reworking that part.

Thanks! We can tune it further based on how it performs in real clusters.

kfaraz

One final non-blocking comment. Rest looks good.

kfaraz · 2026-06-25T11:39:50Z

+  /**
+   * Derives the current idle ratio from measured utilization ({@code avgProcessingRate / maxObservedRate}).
+   */
+  static double estimateIdleFromUtilization(CostMetrics metrics)


Please rename this method to estimateIdleRatioFromProcessingRate() and move it to the class CostMetrics.

FrankChen021

Severity	Findings
P0	0
P1	0
P2	1
P3	0
Total	1

Reviewed 8 of 8 changed files.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-06-25T12:17:04Z

+  static double estimateIdleFromUtilization(CostMetrics metrics)
+  {
+    final Double maxObservedRate = metrics.getMaxObservedRate();
+    final double utilization = Math.min(1.0, metrics.getAvgProcessingRate() / maxObservedRate);


[P2] Restore fallback before dividing by maxObservedRate

When usePollIdleRatio is false, this method derives idle ratio from maxObservedRate, but CostMetrics still allows that value to be null and the added tests explicitly cover null and zero watermarks. Dividing by the boxed value now throws on null, and a zero watermark yields Infinity/NaN behavior instead of the intended IDEAL_IDLE_RATIO cold-start fallback. Please keep the maxObservedRate == null || maxObservedRate <= 0 guard before this division.

Introduce processing rate to max ratio as poll-idle-ratio replacement

401e9fc

github-actions Bot added the Area - Ingestion label Jun 23, 2026

Fly-Style changed the title ~~Introduce processing rate to max ratio as poll-idle-ratio replacement~~ feat: introduce processing rate to max ratio as poll-idle-ratio replacement Jun 23, 2026

Add integration test

c9c833a

Fly-Style requested a review from kfaraz June 24, 2026 08:45

FrankChen021 reviewed Jun 24, 2026

View reviewed changes

Cleanup

c16aee0

kfaraz reviewed Jun 25, 2026

View reviewed changes

Adress review comments

1748bb9

Fly-Style requested a review from kfaraz June 25, 2026 10:11

kfaraz approved these changes Jun 25, 2026

View reviewed changes

FrankChen021 reviewed Jun 25, 2026

View reviewed changes

Fly-Style added 2 commits June 25, 2026 15:36

Cleanup

a98cfe1

Merge branch 'master' into feat/poll-idle-ratio-replacement

99c8594

	private final EvictingQueue<Double> lagGatedRateSamples;
	private final EvictingQueue<Double> lagGatedProcessingRateSamples;

Uh oh!

Conversation

Fly-Style commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fly-Style commented Jun 23, 2026 •

edited

Loading