Skip to content

fix(run-engine): decrement totalWeight in fair-queue weighted env shuffle#4019

Merged
matt-aitken merged 1 commit into
mainfrom
fix/fair-queue-weighted-shuffle-total-weight
Jun 22, 2026
Merged

fix(run-engine): decrement totalWeight in fair-queue weighted env shuffle#4019
matt-aitken merged 1 commit into
mainfrom
fix/fair-queue-weighted-shuffle-total-weight

Conversation

@matt-aitken

Copy link
Copy Markdown
Member

Summary

Fixes the fair-queue weighted environment shuffle, which biased environment ordering whenever fair-queue biases are enabled (the default configuration).

Root cause

#weightedShuffle in fairQueueSelectionStrategy.ts computed the total weight once and drew its random pivot against that full-set total on every iteration, but never decremented the total as items were removed from the working set. After the first pick, the pivot frequently overshot the sum of the remaining items, so the inner selection loop ran off the end and clamped to the last remaining element. The result systematically over-selected whichever environment sat at the tail of the set.

The first slot stayed fair (the full total is correct on the first draw), but later positions were ordered by environment iteration order rather than by the intended concurrency-limit and available-capacity weighting. For four equal-weight environments, the final position landed on one env ~9% of the time and another ~42%, instead of ~25% each.

The two sibling selection paths (#weightedRandomQueueOrder and #selectTopEnvs) already decrement the total before splicing; this brings the env shuffle in line with them.

Fix

result.push(items[index].envId);
totalWeight -= items[index].weight;
items.splice(index, 1);

Adds a regression test that runs the weighted shuffle over equal-weight envs with biases enabled and asserts each env lands in every position roughly uniformly. It fails on the old code (tail position ~37%) and passes with the fix.

Reported in #4001.

…ffle

The weighted env shuffle drew its random pivot against the full set's total
weight on every iteration but never reduced that total as items were removed.
Once items had been picked, the pivot routinely overshot the remaining
weight, the selection loop ran off the end, and the last remaining item was
chosen, over-picking whichever env sat at the tail of the set.

With fair-queue biases enabled (the default), this left the first env slot
fair but skewed later positions by env iteration order instead of by the
intended concurrency and capacity weighting. Decrement totalWeight before
splicing, matching the sibling queue-order and top-env selection paths.
@changeset-bot

changeset-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: c551ba2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: bdf78da3-4c98-4040-b772-07d78f1e422c

📥 Commits

Reviewing files that changed from the base of the PR and between bb92935 and c551ba2.

📒 Files selected for processing (2)
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout. (25)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (5, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (7, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (10, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (4, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (6, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (11, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (9, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (12, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (8, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (1, 12)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (3, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
  • GitHub Check: internal / 🧪 Unit Tests: Internal (2, 12)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Import from @trigger.dev/sdk when writing Trigger.dev tasks. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob

Files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamic import() when circular dependencies cannot be resolved, code splitting is needed for performance, or the module must be loaded conditionally at runtime
Import subpaths only from packages/core (@trigger.dev/core), never import from the root

Files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx}: Never mock anything in tests - use testcontainers instead
Test files should be placed next to source files (e.g., MyService.ts -> MyService.test.ts)

Files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
**/*.{js,ts,tsx,jsx,css,json,md}

📄 CodeRabbit inference engine (AGENTS.md)

Use Prettier for code formatting and run pnpm run format before committing

Files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
**/*.test.{js,ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{js,ts,tsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Use vitest for unit testing
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed

Files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
🧠 Learnings (10)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
📚 Learning: 2026-06-13T19:53:13.759Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3937
File: packages/trigger-sdk/skills/realtime-and-frontend/SKILL.md:258-260
Timestamp: 2026-06-13T19:53:13.759Z
Learning: When reviewing code that uses `trigger.dev/react-hooks`’s `useRealtimeRun`, preserve the call signature where the first argument is the full realtime handle object (not `handle.id`). This is intentional to maintain type-safety and is consistent with the official docs; do not suggest changing the first argument from the handle object to `handle.id`.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
📚 Learning: 2026-06-17T17:13:49.929Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3948
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.bulk-actions.$bulkActionParam/route.tsx:48-62
Timestamp: 2026-06-17T17:13:49.929Z
Learning: In triggerdotdev/trigger.dev, within `dashboardLoader`/`dashboardAction` (or similar context resolver code) whenever you resolve an organization ID from an organization slug for RBAC/enterprise authorization scope, always read from the primary Prisma client (`prisma`), not `$replica`. Using `$replica` can hit replica-lag and cause the RBAC lookup/authorization to run without the correct org scope (bypassing intended role enforcement). Implement the slug→org lookup with `prisma.organization.findFirst(...)` (or equivalent primary-client query) and add an inline comment documenting why the primary client is required (replica lag could lead to unscoped RBAC checks).

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
📚 Learning: 2026-05-18T14:40:02.173Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3658
File: packages/core/src/v3/realtimeStreams/manager.test.ts:1-147
Timestamp: 2026-05-18T14:40:02.173Z
Learning: In the triggerdotdev/trigger.dev repo, the policy “Never mock anything — use testcontainers instead” should only be enforced for integration tests that interact with real external services (e.g., Redis, Postgres) via actual infrastructure. For unit tests that exercise pure in-memory logic (e.g., cache semantics) it is OK to stub collaborators such as `ApiClient` using Vitest (`vi.fn()`) to assert call counts or control behavior. Do not flag `vi.fn()`-based `ApiClient` stubs in unit tests as violations of the testcontainers policy.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
📚 Learning: 2026-06-04T18:16:35.386Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
📚 Learning: 2026-06-09T17:58:04.699Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
  • internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts
📚 Learning: 2026-06-16T09:19:47.637Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3960
File: apps/webapp/test/prismaInfrastructureErrorCapture.test.ts:0-0
Timestamp: 2026-06-16T09:19:47.637Z
Learning: In this repo’s Vitest setup, `vitest.config.ts` uses `globals: true`, so identifiers like `vi`, `describe`, `it`, and `expect` are available as globals in Vitest test files. During code review, do not flag missing `vi`/`describe`/`it`/`expect` imports as a runtime error or correctness issue when they’re used in `*.test.ts/tsx` or `*.spec.ts/tsx` files. Explicit imports are still preferred for consistency, but they’re not required for runtime behavior.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts
🔇 Additional comments (2)
internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts (1)

212-212: LGTM!

Also applies to: 227-231

internal-packages/run-engine/src/run-queue/tests/fairQueueSelectionStrategy.test.ts (1)

1207-1289: LGTM!


Walkthrough

In FairQueueSelectionStrategy#weightedShuffle, totalWeight is changed from const to let, and the loop now subtracts the selected item's weight from totalWeight after each draw. This ensures each subsequent weighted random pick is scaled to the remaining items rather than the original full sum. A new Redis-backed integration test verifies that four equal-weight environments appear with uniform frequency at every result position across 2,000 iterations, with a ±40% tolerance on the expected 1/N share.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main fix: decrementing totalWeight in the fair-queue weighted environment shuffle algorithm.
Description check ✅ Passed The description provides clear root cause analysis, the fix explanation, and mentions a regression test, but the checklist sections are not completed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/fair-queue-weighted-shuffle-total-weight

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

Open in Devin Review

@matt-aitken matt-aitken merged commit 5667461 into main Jun 22, 2026
47 checks passed
@matt-aitken matt-aitken deleted the fix/fair-queue-weighted-shuffle-total-weight branch June 22, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants