Skip to content

HBASE-29555 TestRollbackSCP.testFailAndRollback fails sometimes because non empty Scheduler queue#8357

Open
SwaraliJoshi wants to merge 1 commit into
apache:masterfrom
SwaraliJoshi:HBASE-29555
Open

HBASE-29555 TestRollbackSCP.testFailAndRollback fails sometimes because non empty Scheduler queue#8357
SwaraliJoshi wants to merge 1 commit into
apache:masterfrom
SwaraliJoshi:HBASE-29555

Conversation

@SwaraliJoshi

Copy link
Copy Markdown

Summary

TestRollbackSCP.testFailAndRollback is flaky: it intermittently fails with
java.lang.IllegalArgumentException: scheduler queue not empty from
ProcedureExecutor.load() while restarting the master procedure executor.

The test restarts the ProcedureExecutor in place, reusing the same
executor and MasterProcedureScheduler instances to simulate a failover. While
the executor is being reloaded, other still-running threads of the live
mini-cluster master can push a procedure back into the shared scheduler in the
small window between scheduler.clear() and ProcedureExecutor.load()'s
Preconditions.checkArgument(scheduler.size() == 0, ...).

Two producers were identified:

  • the asyncTaskExecutor callback that wakes a procedure after an async meta
    update (e.g. AssignmentManager.persistToMeta), and
  • an incoming reportRegionStateTransition RPC from a live region server,
    handled on an RpcServer handler thread, which wakes a procedure through
    ProcedureEvent.wake -> scheduler.addFront.

This is a test-infrastructure issue: a real master failover starts a fresh
process with a fresh executor/scheduler, so the production load() precondition
is not affected. The fix is therefore confined to test code.

Changes (ProcedureTestingUtility.restart(), test-only)

  • Wait for the already shut-down asyncTaskExecutor to fully terminate before
    clearing the scheduler, so any pending async wake-up callback has finished
    (closes the dominant, async producer deterministically).
  • Reload (clear -> procStore.start -> init) in a bounded retry loop,
    retrying only when load() fails specifically with scheduler queue not empty. ProcedureExecutor.stop() is explicitly safe to call after a failed
    init(), so this is a clean redo and is robust to any external producer.

Test plan

  • Ran TestRollbackSCP 100x consecutively with the fix: 100/100 passed (twice).
  • Control: reverted the fix and reran: reproduced the failure within a few iterations.
  • Regression: ran affected hbase-procedure tests and a representative
    hbase-server subset (TestSCP, TestProcedureAdmin,
    TestTransitRegionStateProcedure, TestCreateTableProcedure): all pass.

…se non empty Scheduler queue

TestRollbackSCP restarts the master ProcedureExecutor in place, reusing the
same executor and MasterProcedureScheduler instances to simulate a failover.
While the executor is being reloaded, other still-running threads of the live
mini-cluster master can push a procedure back into the shared scheduler in the
window between clearing it and ProcedureExecutor.load() asserting that it is
empty, so load() intermittently fails with "scheduler queue not empty".

There are two such producers:
- the asyncTaskExecutor callback that wakes a procedure after an async meta
  update (e.g. persistToMeta), and
- an incoming reportRegionStateTransition RPC from a live region server, handled
  on an RpcServer thread, which wakes a procedure via ProcedureEvent.

Fix ProcedureTestingUtility.restart() (test-only):
- wait for the already shutdown asyncTaskExecutor to fully terminate before
  clearing the scheduler, so any pending wake-up callback has finished; and
- reload (clear -> store start -> init) in a bounded retry loop, retrying only
  when load() fails because the scheduler is not empty. ProcedureExecutor.stop()
  is explicitly safe to call after a failed init(), so this is a clean redo.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Apache9

Apache9 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I prefer we just change the code the TestRollbackSCP.

We have lots of other tests which rely on this restart, changing the logic may hide some other bugs we want to expose in these tests.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants