Skip to content

Add circuit breaking mechanisms for offset auto-reset backfill#18501

Open
rseetham wants to merge 1 commit into
apache:masterfrom
rseetham:add-backfill-circuit-breaking
Open

Add circuit breaking mechanisms for offset auto-reset backfill#18501
rseetham wants to merge 1 commit into
apache:masterfrom
rseetham:add-backfill-circuit-breaking

Conversation

@rseetham
Copy link
Copy Markdown
Contributor

Introduces four independent circuit breakers to prevent unbounded backfill triggering when a cluster is overwhelmed or restarts after prolonged downtime:

  1. Pause flag per topic (realtime.segment.offsetAutoReset.pause): operator-set boolean in stream config; checked in computeStartOffset() before any backfill decision is made.

  2. Max segments guard (realtime.segment.offsetAutoReset.maxSegmentsBeforeBackfillSkip): skips backfill trigger if table's segment count >= configured limit, preventing znode exhaustion when ingestion is permanently elevated.

  3. Max concurrent backfills per controller (controller.realtime.offsetAutoReset.maxConcurrentBackfillsPerController): caps the number of tables that can simultaneously backfill on a single controller instance, guarding against cluster-restart storms.

  4. Per-partition in-flight collision threshold (controller.realtime.offsetAutoReset.maxBackfillCollisionsBeforeAutoPause, default 3): tracks consecutive backfill-trigger attempts on a partition that already has an active backfill. Below the threshold the new trigger is allowed; at or above the threshold the topic's pause flag is set automatically and a metric is emitted requiring operator intervention.

New ControllerMeter entries are added for each skipped-backfill scenario to enable alerting on all circuit breaker activations.

Fixes: #18314

bugfix

Introduces four independent circuit breakers to prevent unbounded backfill
triggering when a cluster is overwhelmed or restarts after prolonged downtime:

1. Pause flag per topic (`realtime.segment.offsetAutoReset.pause`): operator-set
   boolean in stream config; checked in computeStartOffset() before any backfill
   decision is made.

2. Max segments guard (`realtime.segment.offsetAutoReset.maxSegmentsBeforeBackfillSkip`):
   skips backfill trigger if table's segment count >= configured limit, preventing
   znode exhaustion when ingestion is permanently elevated.

3. Max concurrent backfills per controller
   (`controller.realtime.offsetAutoReset.maxConcurrentBackfillsPerController`):
   caps the number of tables that can simultaneously backfill on a single
   controller instance, guarding against cluster-restart storms.

4. Per-partition in-flight collision threshold
   (`controller.realtime.offsetAutoReset.maxBackfillCollisionsBeforeAutoPause`,
   default 3): tracks consecutive backfill-trigger attempts on a partition that
   already has an active backfill. Below the threshold the new trigger is allowed;
   at or above the threshold the topic's pause flag is set automatically and a
   metric is emitted requiring operator intervention.

New ControllerMeter entries are added for each skipped-backfill scenario to
enable alerting on all circuit breaker activations.

Fixes: apache#18314
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 14, 2026

Codecov Report

❌ Patch coverage is 36.36364% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.84%. Comparing base (1a313c3) to head (d9824bf).

Files with missing lines Patch % Lines
...java/org/apache/pinot/spi/stream/StreamConfig.java 0.00% 7 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (1a313c3) and HEAD (d9824bf). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (1a313c3) HEAD (d9824bf)
java-21 5 4
unittests 2 1
temurin 5 4
unittests2 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18501      +/-   ##
============================================
- Coverage     63.68%   55.84%   -7.85%     
+ Complexity     1684      821     -863     
============================================
  Files          3266     2558     -708     
  Lines        199836   148304   -51532     
  Branches      31023    23951    -7072     
============================================
- Hits         127272    82821   -44451     
+ Misses        62424    58408    -4016     
+ Partials      10140     7075    -3065     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 55.84% <36.36%> (-7.85%) ⬇️
temurin 55.84% <36.36%> (-7.85%) ⬇️
unittests 55.84% <36.36%> (-7.85%) ⬇️
unittests1 55.84% <36.36%> (+<0.01%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Backfill Circuit Breaking for Offset Reset Feature

2 participants