HBASE-29555 TestRollbackSCP.testFailAndRollback fails sometimes because non empty Scheduler queue#8357
Open
SwaraliJoshi wants to merge 1 commit into
Open
HBASE-29555 TestRollbackSCP.testFailAndRollback fails sometimes because non empty Scheduler queue#8357SwaraliJoshi wants to merge 1 commit into
SwaraliJoshi wants to merge 1 commit into
Conversation
…se non empty Scheduler queue TestRollbackSCP restarts the master ProcedureExecutor in place, reusing the same executor and MasterProcedureScheduler instances to simulate a failover. While the executor is being reloaded, other still-running threads of the live mini-cluster master can push a procedure back into the shared scheduler in the window between clearing it and ProcedureExecutor.load() asserting that it is empty, so load() intermittently fails with "scheduler queue not empty". There are two such producers: - the asyncTaskExecutor callback that wakes a procedure after an async meta update (e.g. persistToMeta), and - an incoming reportRegionStateTransition RPC from a live region server, handled on an RpcServer thread, which wakes a procedure via ProcedureEvent. Fix ProcedureTestingUtility.restart() (test-only): - wait for the already shutdown asyncTaskExecutor to fully terminate before clearing the scheduler, so any pending wake-up callback has finished; and - reload (clear -> store start -> init) in a bounded retry loop, retrying only when load() fails because the scheduler is not empty. ProcedureExecutor.stop() is explicitly safe to call after a failed init(), so this is a clean redo. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
|
I prefer we just change the code the TestRollbackSCP. We have lots of other tests which rely on this restart, changing the logic may hide some other bugs we want to expose in these tests. Thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TestRollbackSCP.testFailAndRollbackis flaky: it intermittently fails withjava.lang.IllegalArgumentException: scheduler queue not emptyfromProcedureExecutor.load()while restarting the master procedure executor.The test restarts the
ProcedureExecutorin place, reusing the sameexecutor and
MasterProcedureSchedulerinstances to simulate a failover. Whilethe executor is being reloaded, other still-running threads of the live
mini-cluster master can push a procedure back into the shared scheduler in the
small window between
scheduler.clear()andProcedureExecutor.load()'sPreconditions.checkArgument(scheduler.size() == 0, ...).Two producers were identified:
asyncTaskExecutorcallback that wakes a procedure after an async metaupdate (e.g.
AssignmentManager.persistToMeta), andreportRegionStateTransitionRPC from a live region server,handled on an
RpcServerhandler thread, which wakes a procedure throughProcedureEvent.wake->scheduler.addFront.This is a test-infrastructure issue: a real master failover starts a fresh
process with a fresh executor/scheduler, so the production
load()preconditionis not affected. The fix is therefore confined to test code.
Changes (
ProcedureTestingUtility.restart(), test-only)asyncTaskExecutorto fully terminate beforeclearing the scheduler, so any pending async wake-up callback has finished
(closes the dominant, async producer deterministically).
clear->procStore.start->init) in a bounded retry loop,retrying only when
load()fails specifically withscheduler queue not empty.ProcedureExecutor.stop()is explicitly safe to call after a failedinit(), so this is a clean redo and is robust to any external producer.Test plan
TestRollbackSCP100x consecutively with the fix: 100/100 passed (twice).hbase-proceduretests and a representativehbase-serversubset (TestSCP, TestProcedureAdmin,TestTransitRegionStateProcedure, TestCreateTableProcedure): all pass.