Skip to content

[Bug] Failed PR_FILES jobs can block future file refreshes #124

@YB0y

Description

@YB0y

Description

PullRequestHandler and FetchProcessor both enqueue PR_FILES refreshes with a deterministic BullMQ job id per (repo, PR, head SHA, base SHA) tuple:

jobId: prFilesJobId(repoFullName, prNumber, expectedHeadSha, expectedBaseSha),
removeOnComplete: true,
removeOnFail: 50,
attempts: 3,

BullMQ treats custom job ids as unique while an old job with that id still exists; adding a job with the same id is ignored until the old job is removed. Because failed jobs are retained by removeOnFail: 50, a fetch-pr-files job that exhausts retries (e.g. a transient GitHub Files API failure) remains in the failed set and blocks later refreshes for the same SHA tuple.

This is the same defect class fixed for PR metadata jobs in #118 ("failed metadata jobs must not squat on the stable per-PR jobId"). The PR_FILES jobId is equally stable per SHA tuple, so the same reasoning applies.

The most concrete trigger is the base-retarget refresh introduced by #115 (closing #62): when pull_request.edited arrives with changes.base != null, the handler reuses the same headSha/baseSha tuple to re-enqueue PR_FILES. If a prior fetch on that tuple failed, the retarget refresh is silently dropped, and pr_files / scoring_data_stored stay pinned to the old base — defeating the #115 fix.

Steps to Reproduce

  1. Trigger a pull_request.synchronize event that enqueues fetch-pr-files for a tracked PR.
  2. Make that files job fail all retry attempts, for example through a transient GitHub Files API failure.
  3. Observe that the failed job remains retained because the queue options use removeOnFail: 50.
  4. Deliver a later pull_request.edited webhook with changes.base (the base-retarget path from fix(webhook): refresh PR files on pull_request.edited base retarget (#62) #115) for the same PR, where the head SHA is unchanged.
  5. Observe that the handler tries to enqueue the same job id, prFilesJobId(repoFullName, prNumber, headSha, baseSha).
  6. The fresh files refresh can be skipped because the retained failed job still owns that custom job id.

The same scenario reproduces from FetchProcessor.enqueuePrFilesJob (used by the worker-side re-enqueue path).

Expected Behavior

A later pull_request.edited (base retarget) or any other same-SHA re-enqueue should be able to enqueue a fresh files refresh after a previous files job failed.

The mirror should eventually update:

  • pr_files
  • pr_file_contents
  • pull_requests.scoring_data_stored / pull_requests.base_sha consistency after a base retarget

Actual Behavior

A retained failed files job can cause later refresh attempts for the same (repo, PR, head SHA, base SHA) tuple to be ignored by BullMQ. The mirror keeps stale file scoring data until the failed job is manually removed or evicted by the removeOnFail: 50 retention window.

Environment

  • OS: Any
  • Runtime/Node version: Node 20
  • Browser (if applicable): N/A

Additional Context

Affected code:

  • packages/das/src/webhook/handlers/pull-request.handler.ts:121
  • packages/das/src/queue/fetch.processor.ts:213 (inside enqueuePrFilesJob)

Suggested narrow fix: mirror #118's diff — replace removeOnFail: 50 with removeOnFail: true at the two enqueue sites above so failed jobs evict immediately and free the deterministic jobId.

This is distinct from existing issues / PRs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions