Skip to content

reset pending orchestrations when worker restart#1358

Open
kaibocai wants to merge 1 commit into
mainfrom
kaibocai/reset-pending-item-drain
Open

reset pending orchestrations when worker restart#1358
kaibocai wants to merge 1 commit into
mainfrom
kaibocai/reset-pending-item-drain

Conversation

@kaibocai
Copy link
Copy Markdown
Member

This PR improves partition drain behavior for Azure Storage control queues. When a partition is released, any control queue messages that were already dequeued but not yet dispatched to an active orchestration session are now abandoned with zero visibility timeout, making them immediately visible for the next partition owner.

The change prevents a throughput gap during lease transitions where pending in-memory messages could otherwise remain invisible until their original visibility timeout expired.

Related ICM: https://portal.microsofticm.com/imp/v5/incidents/details/21000001021644/summary

Copilot AI review requested due to automatic review settings May 21, 2026 22:46
Comment on lines +246 to +254
catch (Exception e)
{
this.settings.Logger.PartitionManagerWarning(
this.storageAccountName,
this.settings.TaskHubName,
this.settings.WorkerId,
this.Name,
$"Failed to abandon message {queueMessage.MessageId} during drain: {e.Message}");
}
Comment on lines +259 to +261
catch (OperationCanceledException)
{
}
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Azure Storage partition drain behavior by ensuring control-queue messages that were dequeued but not yet dispatched are re-exposed immediately when a partition is released, avoiding a throughput gap during lease transitions.

Changes:

  • Abandon pending (dequeued-but-undispatched) control-queue messages with zero visibility timeout during DrainAsync.
  • Guard GetNextSessionAsync against ready-queue nodes that were drained/removed from the pending list.
  • Add a unit test intended to validate skipping of drained ready-queue nodes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/DurableTask.AzureStorage/OrchestrationSessionManager.cs Abandons pending message batches during drain and skips drained nodes during session dequeue.
src/DurableTask.AzureStorage/Messaging/ControlQueue.cs Adds a drain-specific abandon path that updates message visibility timeout to zero.
Test/DurableTask.AzureStorage.Tests/OrchestrationSessionTests.cs Adds a test for drained ready-queue node handling (but currently under an inactive test directory).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +228 to +235
[TestMethod]
public async Task GetNextSessionAsync_DrainedReadyQueueNode_IsIgnored()
{
var settings = new AzureStorageOrchestrationServiceSettings
{
StorageAccountClientProvider = new StorageAccountClientProvider("UseDevelopmentStorage=true"),
};
var stats = new AzureStorageOrchestrationServiceStats();
Comment on lines +253 to +261
using var cts = new CancellationTokenSource(TimeSpan.FromMilliseconds(100));
try
{
await manager.GetNextSessionAsync(entitiesOnly: false, cts.Token);
Assert.Fail("Expected cancellation after the drained node was skipped.");
}
catch (OperationCanceledException)
{
}
{
// Remove the partition from memory
// Make dequeued-but-undispatched messages visible before dropping the partition.
await this.AbandonPendingMessagesAsync(partitionId);
this.settings.TaskHubName,
this.settings.WorkerId,
this.Name,
$"Failed to abandon message {queueMessage.MessageId} during drain: {e.Message}");
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants