Skip to content

SOLR-18080: Initiate Leader election for ShardTerms#4069

Open
HoustonPutman wants to merge 19 commits intoapache:mainfrom
HoustonPutman:solr-18080-shard-term-induce-leader-election
Open

SOLR-18080: Initiate Leader election for ShardTerms#4069
HoustonPutman wants to merge 19 commits intoapache:mainfrom
HoustonPutman:solr-18080-shard-term-induce-leader-election

Conversation

@HoustonPutman
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-18080

Whenever the shard terms change such that the leader is no longer the highest term, a leader election should take place, and all non-up-to-date replicas should go into recovery.

lastRecoveryTerm = lastTermDoRecovery.get();
newTerm = terms.getTerm(coreNodeName);
if (lastRecoveryTerm < newTerm) {
lastTermDoRecovery.set(newTerm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastTermDoRecovery is set here but its possible recovery is deferred below because of leader election now. Is that right? The old logic set it, then actually does recovery regardless. Reading this, seems like there is a possibility that lastTermDoRecovery is set to the new term but can skip actually doing recovery further down. So the term this was set to is incorrect based on the name if recovery is skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lastTermDoRecovery might be a bad name, but after leader election, recovery is guaranteed for these replicas at this term value. So while recovery is not explicitly being done here, we know that leader election will do the recovery. So lastTermDoRecovery is still technically correct, just assuming the leader election succeeds.

return replicasNeedingRecovery.contains(key);
}

public ShardTerms setHighestTerms(Set<String> highestTermKeys) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole "term" algorithm makes some pretty strict assumptions about who can update term values, and on what conditions. From class javadocs on ZkShardTerms:

 * <p>Terms can only updated in two strict ways:
 *
 * <ul>
 *   <li>A replica sets its term equals to leader's term
 *   <li>The leader increase its term and some other replicas by 1
 * </ul>

This method seems to fit under the latter provision, which is good. But could we add Javadocs here to indicate that this method should only be called by current shard-leaders? Or if this is safe for non-leaders to call in certain situations, add javadocs to describe what those are and why.

Just trying to defend against the possibility of someone coming back through here in a month or two and thinking: "Hey this doesn't fit with the documented algorithm at all"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this ultimately doesn't happen from the leader, and the leaders term is not guaranteed to be increased. If we see in our new test class, leader election can be triggered because of this new API.

I think the easiest thing to do here is insist that the collection is in a read-only state when this API is called. I'm not sure how I'll do that, but it will definitely guard against any issues with missing updates or anything like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added a param to pass the DocCollection in to make sure that the collection is in a read-only state. This could possibly be improved as an API later, but the functionality is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also updated the documentation to mention this new use of the class, which should not interfere with the other two uses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added a param to pass the DocCollection in to make sure that the collection is in a read-only state.

Sgtm, thanks @HoustonPutman

mutate(terms -> terms.increaseTerms(leader, replicasNeedingRecovery));
}

public void ensureHighestTerms(Set<String> mostUpToDateCores) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, re: my previous comment on ShardTerms.setHighestTerms. We should add some Javadocs to make sure this only called by leaders, or if it's actually safe to call elsewhere describe where and why.

state -> {
Slice shardState = state.getSlice(shard);
for (Replica r : recoveryReplicas) {
if (shardState.getReplica(r.name).getState() != Replica.State.RECOVERING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Since "recovering" is a transient state and the doc/index size here is very small, is it possible that the replica would go into recovery as expected and waitForState would just miss it based on when it polls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe waitForState uses a ZK watcher, so it should be safe and generally get all of the ZK updates... I haven't had it fail yet and I've run it at least 100 times. But yeah there is a concern there, so might think about it later. Ultimately going into recovery is very necessary here, and didn't used to be what happened always.

Ultimately, we can probably just check that the shard terms all become equal in the end and that the docs are there. That would probably be good enough (And check that the replicas become active of course).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I changed it to make sure that all given replicas go into recovery at some point, not necessarily at the same point. This should make it a bit more resilient on maybe weird hardware.

return nodeProps.getStr(ReplicaStateProps.CORE_NAME);
}

public String getCoreNodeName() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even a little example of what this is/means would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally the same as replica name, but coreNodeName is used very widely across the codebase, so I'm not sure javadocs are really necessary here, since none of the other methods have any.

waitForState("Timeout waiting for 1x3 collection", collectionName, clusterShape(1, 3));
assertDocsExistInAllReplicas(Arrays.asList(leader, replica1), collectionName, 1, 3);

try (ZkShardTerms zkShardTerms =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would have actually failed before this PR. So we are fixing a case where zkShardTerms can fail to be set to the leader term after "recovery" (in this case peer-sync).

private final ReentrantLock recoveryLock = new ReentrantLock();

private final ActionThrottle recoveryThrottle = new ActionThrottle("recovery", 10000);
private final ActionThrottle recoveryThrottle = new ActionThrottle("recovery", 1000);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Mark about this, he isn't against lowering it to 1 second. 10 seconds make any tests that explicitly fail recoveries and then succeed recoveries hard to do.

@HoustonPutman
Copy link
Contributor Author

I think this should be good to go now...

Copy link
Contributor

@gerlowskija gerlowskija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the changes; looks good to me @HoustonPutman !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants