Skip to content

Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG#378

Open
e9e4e5f0faef wants to merge 3 commits intostagefrom
feat/stage-efs-isolation
Open

Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG#378
e9e4e5f0faef wants to merge 3 commits intostagefrom
feat/stage-efs-isolation

Conversation

@e9e4e5f0faef
Copy link
Copy Markdown
Collaborator

@e9e4e5f0faef e9e4e5f0faef commented Apr 13, 2026

Summary

  • Replace the previous shared-EFS approach with a dedicated stage-owned EFS filesystem
  • Add an EFS access point (/addons, POSIX 9500:9500) so the eventual NETAPP_STORAGE_ROOT flip does not fail on filesystem permissions for the non-root olympia runtime user
  • Fix perpetual Amazon MQ broker replacement caused by engineType casing mismatch
  • Correct Memcached ingress rule placement to the security group actually used by the cluster
  • Fix pre-flight validator to recognise Amazon MQ .on.aws domain endpoints
  • Remove obsolete SG rules from the superseded storage and broker designs

Changes

File Change
infra/pulumi/__main__.py Add dedicated stage EFS filesystem, mount targets, and access point; inject access-point authorisation into web/worker/cron volume configs (scoped to the addons-efs volume); correct Memcached ingress rule placement; fix engine_type casing to prevent force-new broker replacement; remove obsolete SG rules
infra/pulumi/config.stage.yaml Replace shared EFS mount-target configuration with dedicated stage EFS configuration
infra/scripts/preflight_check.py Recognise .mq.<region>.on.aws endpoints in broker isolation and SG reachability checks

Why

Dedicated stage EFS: the previous approach tried to mount a shared filesystem across VPC boundaries and failed with MountTargetConflict. A dedicated stage filesystem keeps storage aligned with the stage isolation model and avoids the cross-VPC limitation.

EFS access point for olympia UID/GID 9500: the application runs as olympia (UID 9500) per Dockerfile.ecs. Without an access point, the eventual NETAPP_STORAGE_ROOT flip from /tmp/storage to /var/addons would fail with EACCES because an empty EFS root directory is owned by root:root with 0755 permissions. The access point at /addons with POSIX 9500:9500 exposes a writable subtree without requiring a root-task bootstrap ritual at activation time. Containers still mount it at /var/addons via mountPoints. Injection is scoped to the volume named addons-efs and applies to web/worker (YAML-defined) and cron (Python-constructed).

MQ broker drift fix: AWS returns engineType as RabbitMQ, while the code previously used RABBITMQ. Because this is a force-new field, the mismatch caused a perpetual broker replacement diff. The fix aligns the configured value with the value returned by AWS.

Memcached SG correction: the 11211 ingress rule was attached to the wrong security group, so it had no effect. This moves it to the correct SG and restores the intended cache connectivity.

Validator fix: Amazon MQ RabbitMQ endpoints use .mq.<region>.on.aws, which the validator did not previously recognise.

Storage activation model (staged)

This PR creates and mounts the dedicated stage EFS filesystem at /var/addons on web/worker/cron, with the access point in place so the runtime user can write into it. It does not switch application writes to EFS.

After this PR deploys:

  • EFS is mounted at /var/addons via the /addons access point
  • NETAPP_STORAGE_ROOT remains /tmp/storage
  • Application file writes therefore remain ephemeral until a deliberate later flip

The NETAPP_STORAGE_ROOT flip is intentionally a separate, future operational step. It is gated on post-deploy validation (mount verified, write/read/delete as olympia UID 9500 succeeds, persistence across task restart confirmed).

Validation

  • pulumi preview shows + 6 to create / ~ 20 to update / - 3 to delete / +- 3 to replace / = 140 unchanged. The creates are the expected dedicated EFS resources, access point, and the corrected Memcached SG reachability. The replaces are the task-definition updates that pick up the new filesystem ID and access-point authorisation. No broker replacement.
  • ruff check and ruff format --check pass
  • Pre-flight validator passes with the .on.aws fix applied

Safety

  • Scheduled tasks remain disabled
  • The MQ change is drift prevention only; no functional broker migration is included here
  • The new EFS filesystem is created empty; no existing data is modified
  • Normal application writes remain inert with respect to EFS because NETAPP_STORAGE_ROOT still points at /tmp/storage
  • The removed SG rules belong to superseded designs and are not used by running services

Follow-up


Addresses part of #375, with issue closure to follow post-deploy validation.

@e9e4e5f0faef e9e4e5f0faef requested a review from Sancus April 13, 2026 17:58
@e9e4e5f0faef e9e4e5f0faef self-assigned this Apr 13, 2026
@e9e4e5f0faef e9e4e5f0faef changed the title Introduce dedicated stage EFS; fix .on.aws validator support Introduce dedicated stage EFS; fix MQ broker drift and Memcached SG Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant