Skip to content

LPX-649: extract queue & scheduler into their own ECS services#73

Open
stevethomas wants to merge 2 commits into
mainfrom
steve/distracted-boyd-d8122f
Open

LPX-649: extract queue & scheduler into their own ECS services#73
stevethomas wants to merge 2 commits into
mainfrom
steve/distracted-boyd-d8122f

Conversation

@stevethomas
Copy link
Copy Markdown
Member

Hey, I made a thing! 🥳

LPX-649 — extract queue & scheduler into their own ECS services.

What problems are you solving?

YOLO bundles web + queue + scheduler into one Fargate task, coupling three workloads with different scaling shapes onto a single desiredCount. This makes each a service it can scale independently:

  • web — target tracking (unchanged).
  • queue — its own service, scale-to-zero by default. Backlog-per-task target tracking (ApproximateNumberOfMessagesVisible / RunningTaskCount via CloudWatch metric math, no Lambda), plus a step-scaling alarm that lifts it 0→1 the instant a message lands (target tracking can't divide by zero running tasks). Opt-in Fargate Spot (~70% cheaper). Costs ~$0 idle.
  • scheduler — its own pinned-singleton service (min=max=1, never a scalable target), deployed stop-then-start so a rollout never briefly runs two crons. Drops the ->onOneServer() requirement.

Topology is encoded by location, not a flag (your call on the design):

Manifest Means
tasks.web.queue / tasks.web.scheduler bundled in the web container — warm, instant pickup. Unchanged.
top-level tasks.queue / tasks.scheduler extracted into its own service with the grown-up config.
both, for one workload hard error — pick one.

So tasks.web.queue = "a chore the web box also does"; tasks.queue = "a workload that stands on its own." Nothing breaks — existing manifests are untouched, extraction is additive opt-in.

Also in this PR:

  • --group on deploy / run; scale --queue (min 0 = scale to zero); group-aware DeployerPolicy.
  • One image serves every role — the task-def passes the role as the container command and the entrypoint dispatches with a per-role graceful drain.
  • Retires the dead EC2-era RunsOnAws*Environment detectors + the unused ParsesOnlyOption concern (no Fargate implementors).
  • Hardened Manifest::put's surgical YAML writer to fall back to a full re-dump for an inline-empty-map parent (queue: {}) rather than corrupting it.
  • Docs: scaling guide, manifest + commands reference, yolo.yml / Dockerfile stubs.

Is there anything the reviewer needs to know to deploy this?

  • No infra was touched and nothing was merged — this is code + docs only.
  • No breaking manifest change. Bundled tasks.web.queue/scheduler keep working exactly as before; CL's yolo.yml needs no changes. Extraction is opt-in.
  • Multi-tenancy: a standalone queue is one service per app on the default/landlord queue; per-tenant queue fan-out is out of scope and composes with LPX-601.
  • The web ALB health-gate is unchanged; a --group-scoped deploy that omits web skips that wait and relies on the ECS circuit breaker for the headless services.
  • 553 Pest pass · PHPStan clean · Pint clean · VitePress docs build clean. Rebased onto latest main (incl. feat(sync): heartbeat + realistic timeout for slow AWS waiters #72 and the IAM-policy-drift / elasticache deployer changes).

🤖 Generated with Claude Code

stevethomas and others added 2 commits June 2, 2026 18:18
Promote the bundled web+queue+scheduler task into three independent,
group-aware ECS services so each workload scales on its own shape:

- web: target tracking (unchanged)
- queue: standalone service, scale-to-zero by default — backlog-per-task
  target tracking (MessagesVisible / RunningTaskCount metric math) plus a
  step-scaling alarm to lift it 0->1; opt-in Fargate Spot
- scheduler: pinned-singleton service (min=max=1), deployed stop-then-start
  so a rollout never runs two crons (drops the onOneServer() requirement)

Topology is encoded by location: bundled via tasks.web.queue/scheduler
(warm, instant pickup — unchanged), extracted via top-level tasks.queue /
tasks.scheduler. Configuring a workload both ways hard-fails. One image
serves every role; the task definition passes the role as the container
command and the entrypoint dispatches with a per-role graceful drain.

Also: --group on deploy/run, scale --queue, a group-aware DeployerPolicy,
and retires the dead EC2-era RunsOnAws*Environment detectors + the unused
ParsesOnlyOption concern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts from main's audit refactor + DynamoDB removal landing
ahead of this PR:

- ScaleCommand: keep the queue-as-its-own-service path (resolveGroup
  returns ServerGroup; docblock examples) over main's 'queue not yet'
  placeholders — this PR is what makes queue scaling real.
- SyncAppCommand + advisory test: keep the fuller scheduler advisory
  that points at the new top-level 'tasks.scheduler' block (main only
  trimmed it because that feature did not exist there yet).
- docs/guide/scaling.md: take main's lock-store wording (drops the
  removed DynamoDB, names Valkey/Redis); keep this PR's tasks.scheduler
  extraction section.
- docs/reference/commands.md: keep the standalone --queue row; drop
  main's duplicate --min/--max row.

Absorbs main's DynamoDB removal (sessions on Valkey). pint, phpstan,
546 pest, and the VitePress build all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant