Skip to content

Refactor CICD label management to use per-PR sequential queue#5035

Open
eyebrowsoffire wants to merge 8 commits into
flutter:mainfrom
eyebrowsoffire:serial_pr_handling
Open

Refactor CICD label management to use per-PR sequential queue#5035
eyebrowsoffire wants to merge 8 commits into
flutter:mainfrom
eyebrowsoffire:serial_pr_handling

Conversation

@eyebrowsoffire

Copy link
Copy Markdown
Contributor

This change overhauls the management of the CICD label in Cocoon to address race conditions and non-idempotent operations that caused issues in the previous attempt.

Key changes:

  • Introduced PullRequestManager to handle events for a single PR sequentially via an operation queue.
  • Moved PR-specific logic from webhook_subscription.dart to PullRequestManager, thinning out the webhook handler.
  • Implemented new state machine logic for opened and synchronize events to handle privileged users and draft PRs correctly.
  • Made _scheduleIfMergeable idempotent by tracking scheduledSha in Firestore to prevent duplicate triggering of presubmits for the same SHA.
  • Refactored PullRequestManager creation to be synchronous in the cache, queueing the asynchronous initialization to avoid creation race conditions.
  • Restored cicd_flowchart.md from a previous commit to document the flow.
  • Fixed all static analysis issues and ensured all tests pass.

This change overhauls the management of the `CICD` label in Cocoon to address race conditions and non-idempotent operations that caused issues in the previous attempt.

Key changes:
- Introduced `PullRequestManager` to handle events for a single PR sequentially via an operation queue.
- Moved PR-specific logic from `webhook_subscription.dart` to `PullRequestManager`, thinning out the webhook handler.
- Implemented new state machine logic for `opened` and `synchronize` events to handle privileged users and draft PRs correctly.
- Made `_scheduleIfMergeable` idempotent by tracking `scheduledSha` in Firestore to prevent duplicate triggering of presubmits for the same SHA.
- Refactored `PullRequestManager` creation to be synchronous in the cache, queueing the asynchronous initialization to avoid creation race conditions.
- Restored `cicd_flowchart.md` from a previous commit to document the flow.
- Fixed all static analysis issues and ensured all tests pass.
@flutter-dashboard flutter-dashboard Bot added the CICD Run CI/CD label Apr 28, 2026
@eyebrowsoffire eyebrowsoffire added CICD Run CI/CD and removed CICD Run CI/CD labels Apr 29, 2026

@jtmcdole jtmcdole left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still working through this, but wanted to post some early comments.

Comment thread app_dart/docs/cicd_flowchart.md
Comment thread app_dart/lib/src/model/firestore/pull_request_state.dart Outdated
Comment thread app_dart/lib/src/model/firestore/pull_request_state.dart Outdated
Comment thread app_dart/lib/src/model/firestore/pull_request_state.dart Outdated
Comment thread app_dart/lib/src/model/firestore/pull_request_state.dart Outdated
Comment thread app_dart/lib/src/service/pull_request_manager.dart Outdated
Comment thread app_dart/lib/src/service/pull_request_manager.dart Outdated
Comment thread app_dart/lib/src/service/pull_request_manager.dart Outdated
Comment thread app_dart/lib/src/service/pull_request_manager.dart Outdated
Comment thread app_dart/lib/src/service/pull_request_manager.dart

@jtmcdole jtmcdole left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really just one change, the UUID get/set in one step. LGTM otherwise

set isPrivileged(bool? value) {
if (value != null) {
fields[kIsPrivilegedField] = value.toValue();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know when we'd use it, but would you want:

        } else {
          fields.remove(kIsPrivilegedField);
        }

We could do that for kLatestShaField and kScheduledShaField as well - again, not sure when we'd use it, but maybe worth having?

final lockValue = Uint8List.fromList('l'.codeUnits);

// Attempt to acquire lock
final existingLock = await cache.getOrCreate(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to go read getOrCreateWithLocking() to realize it is only "locked" for the local instance, and not across multiple instances.

The dashboard relies on uuid; so you could:

//  uuid: ^4.5.3
final uuid = Uuid();

final value = Uuid.parseAsByteList((uuid.v4()));
final existing = await cache.getOrCreate('pr_locks', lockKey, createFn: () => value, ttl: const Duration(minutes: 5));

/// I DO NOT KNOW IF THIS IS TRUE: set() doesn't have good dartdoc and I don't know if the value that is set is returned.
if (existingLock != value) {
       throw const ServiceUnavailable('PR is locked by another instance'); 
}

// no need to call "set()"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will still hit a snag if two webhooks come in at the same time, even if we use Uuid - both instances will see null, both will then write the uuid, and both will get back the Uuid that they wrote. neat_cache is just really... poor for this.

do we want to consider using package:redis or package:redis_dart_client? They both appear to have "setex()" which is exactly what we're trying to do here? Or do we want to ask the dart team if they will add setex to neat_cache?

@jtmcdole

Copy link
Copy Markdown
Member

Looking at this again right now

@jtmcdole jtmcdole left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still concerned with "locking" atomicity. We could argue that two webhooks won't come in at the same time - but I believe we we get many hooks delivered in quick succession:

Same PR, messages at the same time:

  • aaf18850-6e79-11f1-8fbc-2d43f66bf3b9 pull_request.opened 2026-06-22 13:33:47
  • ab005560-6e79-11f1-84ca-8bb5774e4c64 pull_request.labeled 2026-06-22 13:33:47
SEARCH("`pull_request` 188347")
protoPayload.resource = "/api/github-webhook-pullrequest"

why labeled came in before create? No clue! But it looks like they got handled by different instanceId's

try {
return await action();
} finally {
await cache.purge('pr_locks', lockKey);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If action() takes longer than 5 minutes, the lock is already lost because of the TTL. Another webhook will get the lock and it will have it purged by this call.

Use the UUID I mentioned in a previous comment. Then you can read it here and if it's the same as what you locked above, you can purge it. If not, we should totally log a warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CICD Run CI/CD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants