Refactor CICD label management to use per-PR sequential queue#5035
Refactor CICD label management to use per-PR sequential queue#5035eyebrowsoffire wants to merge 8 commits into
Conversation
This change overhauls the management of the `CICD` label in Cocoon to address race conditions and non-idempotent operations that caused issues in the previous attempt. Key changes: - Introduced `PullRequestManager` to handle events for a single PR sequentially via an operation queue. - Moved PR-specific logic from `webhook_subscription.dart` to `PullRequestManager`, thinning out the webhook handler. - Implemented new state machine logic for `opened` and `synchronize` events to handle privileged users and draft PRs correctly. - Made `_scheduleIfMergeable` idempotent by tracking `scheduledSha` in Firestore to prevent duplicate triggering of presubmits for the same SHA. - Refactored `PullRequestManager` creation to be synchronous in the cache, queueing the asynchronous initialization to avoid creation race conditions. - Restored `cicd_flowchart.md` from a previous commit to document the flow. - Fixed all static analysis issues and ensured all tests pass.
jtmcdole
left a comment
There was a problem hiding this comment.
I am still working through this, but wanted to post some early comments.
jtmcdole
left a comment
There was a problem hiding this comment.
Really just one change, the UUID get/set in one step. LGTM otherwise
| set isPrivileged(bool? value) { | ||
| if (value != null) { | ||
| fields[kIsPrivilegedField] = value.toValue(); | ||
| } |
There was a problem hiding this comment.
I don't know when we'd use it, but would you want:
} else {
fields.remove(kIsPrivilegedField);
}We could do that for kLatestShaField and kScheduledShaField as well - again, not sure when we'd use it, but maybe worth having?
| final lockValue = Uint8List.fromList('l'.codeUnits); | ||
|
|
||
| // Attempt to acquire lock | ||
| final existingLock = await cache.getOrCreate( |
There was a problem hiding this comment.
I had to go read getOrCreateWithLocking() to realize it is only "locked" for the local instance, and not across multiple instances.
The dashboard relies on uuid; so you could:
// uuid: ^4.5.3
final uuid = Uuid();
final value = Uuid.parseAsByteList((uuid.v4()));
final existing = await cache.getOrCreate('pr_locks', lockKey, createFn: () => value, ttl: const Duration(minutes: 5));
/// I DO NOT KNOW IF THIS IS TRUE: set() doesn't have good dartdoc and I don't know if the value that is set is returned.
if (existingLock != value) {
throw const ServiceUnavailable('PR is locked by another instance');
}
// no need to call "set()"
There was a problem hiding this comment.
I think this will still hit a snag if two webhooks come in at the same time, even if we use Uuid - both instances will see null, both will then write the uuid, and both will get back the Uuid that they wrote. neat_cache is just really... poor for this.
do we want to consider using package:redis or package:redis_dart_client? They both appear to have "setex()" which is exactly what we're trying to do here? Or do we want to ask the dart team if they will add setex to neat_cache?
|
Looking at this again right now |
jtmcdole
left a comment
There was a problem hiding this comment.
I'm still concerned with "locking" atomicity. We could argue that two webhooks won't come in at the same time - but I believe we we get many hooks delivered in quick succession:
Same PR, messages at the same time:
- aaf18850-6e79-11f1-8fbc-2d43f66bf3b9 pull_request.opened 2026-06-22 13:33:47
- ab005560-6e79-11f1-84ca-8bb5774e4c64 pull_request.labeled 2026-06-22 13:33:47
SEARCH("`pull_request` 188347")
protoPayload.resource = "/api/github-webhook-pullrequest"
why labeled came in before create? No clue! But it looks like they got handled by different instanceId's
| try { | ||
| return await action(); | ||
| } finally { | ||
| await cache.purge('pr_locks', lockKey); |
There was a problem hiding this comment.
If action() takes longer than 5 minutes, the lock is already lost because of the TTL. Another webhook will get the lock and it will have it purged by this call.
Use the UUID I mentioned in a previous comment. Then you can read it here and if it's the same as what you locked above, you can purge it. If not, we should totally log a warning.
This change overhauls the management of the
CICDlabel in Cocoon to address race conditions and non-idempotent operations that caused issues in the previous attempt.Key changes:
PullRequestManagerto handle events for a single PR sequentially via an operation queue.webhook_subscription.darttoPullRequestManager, thinning out the webhook handler.openedandsynchronizeevents to handle privileged users and draft PRs correctly._scheduleIfMergeableidempotent by trackingscheduledShain Firestore to prevent duplicate triggering of presubmits for the same SHA.PullRequestManagercreation to be synchronous in the cache, queueing the asynchronous initialization to avoid creation race conditions.cicd_flowchart.mdfrom a previous commit to document the flow.