Add Azure Container Instances backend + CTFd 3.8 compatibility#5
Open
therealcybermattlee wants to merge 23 commits into
Open
Add Azure Container Instances backend + CTFd 3.8 compatibility#5therealcybermattlee wants to merge 23 commits into
therealcybermattlee wants to merge 23 commits into
Conversation
Containers are now tracked per-user instead of per-team: ContainerInfoModel swaps team_id for user_id with an FK to users.id, the four player routes drop the team membership check and pass user.id, and the admin dashboard relabels the column accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces a second container backend selectable from the admin Settings page. Choosing "Azure Container Instances" provisions a container group per challenge attempt via azure-mgmt-containerinstance, authenticates to Azure with DefaultAzureCredential, and attaches a user-assigned managed identity to each spawned group so it can pull private images from ACR without storing registry credentials. Schema gains a nullable `hostname` column on ContainerInfoModel so each ACI container's per-group DNS name can be returned to the player; the Docker backend still falls back to the global docker_hostname setting. Settings page is restructured around a backend selector with show/hide between Docker and Azure fieldsets. Also includes a handful of defensive fixes the design review flagged: the manager exposes a shutdown() method that the settings route now calls before swapping backends (was leaking BackgroundScheduler instances on every save); the ACI poller has an explicit 300s timeout and 5s polling interval; all previously silent except blocks log to stdout; the expired-container sweep commits once per pass instead of per row; and the admin kill route no longer crashes when the DB row has already been removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The synchronous request flow couldn't survive a 30-60 second ACI provision through CTFd's stock Gunicorn (30s default timeout) and nginx (60s default proxy_read_timeout), let alone Cloudflare's 100s edge limit. POST /containers/api/request now writes a placeholder row with status="provisioning", spawns a daemon thread to call the backend, and returns HTTP 202 with the row id. A new GET /api/status/<id> lets the frontend poll every 3 seconds (up to 3 minutes) until the row flips to running or failed. ContainerInfoModel switches to an auto-increment id primary key so the row exists before the backend has assigned a container_id; container_id, port, and hostname are nullable until provisioning completes; status and error_message are new columns. A unique constraint on (challenge_id, user_id) closes the TOCTOU window — concurrent requests catch IntegrityError and return the winning row. If a user stops or resets while provisioning is still in flight, the worker re-fetches the row after create succeeds; finding it gone, it kills the just-created container so we don't leak (and pay for) an orphan ACI group. Admin /api/kill, /api/purge, /api/stop, and the admin dashboard template now handle rows where container_id is still None (because the backend hasn't been called yet), and the dashboard adds a Status column showing provisioning / running / failed. The view.js request and reset handlers are rewritten in fetch with a polling loop; renew and stop stay XHR since they're fast operations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clamp container memory to [0.1, 16] GB and CPU to [0.1, 4.0] cores so
an admin can't submit a spec ACI will reject at provision time. Cap
get_images at 20 tags per repo and order by last-updated descending,
so a large ACR doesn't hang the create-challenge dropdown while it
enumerates every tag. Pass UserAssignedIdentities() rather than {}
to ContainerGroupIdentity for SDK type-correctness; the wire payload
is identical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks through clone, custom Dockerfile that pip-installs the Azure SDK, compose.yaml wiring with service-principal env vars, Caddy as the TLS front-end, schema-drop instructions for upgraders, and an end-to-end verification step. Matches the common self-hosted layout where compose.yaml, CTFd/, and caddy/ sit side-by-side under a single parent directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the compose.yaml example to mirror the typical ~/ctfd-stack configuration: ctfd/ctfd:3.8.4 image swap to build:, env vars carried through (REVERSE_PROXY, WORKERS, DATABASE_URL pattern), .ctfd_secret_key file mount with a pre-create step so Docker doesn't create it as a directory. Calls out the ctfd_frontend + ctfd_backend dual-attachment required for Azure egress and DB access. Notes Caddy with Cloudflare DNS-01 (caddy-cloudflare image) needs no proxy-timeout changes thanks to the async refactor. Adds a managed-identity vs service-principal split for Azure auth, with shell commands for assigning a system-assigned MI to the CTFd VM and granting Contributor + AcrPull on the appropriate scopes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The badge always read "Docker Connected/Not Connected" regardless of backend. When ACI is the active backend, label it accordingly so admins can tell at a glance which provisioner is wired up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
State-only PATCHes (e.g. Finish -> set visible) re-run calculate_value on the existing challenge. If decay is 0, the curve math hits a divide by zero and returns 500, leaving the Options modal stuck open. Treat decay=0 as a static challenge worth `initial` instead of throwing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`CTFd.lib.markdown()` is undefined at script-parse time on CTFd 3.8.x, so reading it at module top-level threw a TypeError that aborted the rest of view.js. Without preRender/render/postRender registered, the challenge modal failed to open. Initialize the renderer lazily inside preRender (and on first render() as a fallback). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Append ?v=<sha256-prefix> to create.js/update.js/view.js URLs so browsers fetch a fresh copy whenever the file content changes. The hash is computed once at module load, so a container restart picks up new asset content automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd 3.8.x ships with CTFd.lib only exposing {\$, dayjs} — the markdown
helper isn't there. Detect both legacy (CTFd.lib.markdown() returns a
renderer) and newer (CTFd.lib.markdown is the renderer) shapes, and
fall back to escaped plain text if neither exists so the modal still
opens.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd 3.8.x renders the challenge modal server-side using Alpine.js, and
the {% block connection_info %} override in view.html no longer applies
(block name/structure changed in newer CTFd). Inject the Request
Connection Info button (and reset/stop/renew controls) via JS in
postRender by anchoring off the stable .challenge-desc element. This
removes our dependency on CTFd's internal template structure.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ACI rejects memory_in_gb that isn't a multiple of 0.1 (e.g. 1500 MB / 1024 = 1.46484375 GB returned MemoryRequirementNotTimesOfOneTenthGB). Round memory to the nearest 0.1 GB and cpu to the nearest 0.01 cores so user-provided MB/CPU settings always satisfy Azure's precision. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously, closing and reopening the challenge modal would always show the Request Connection Info button, and clicking it would spawn a NEW container even if the user already had one running for that challenge. Add a GET /api/running/<chal_id> endpoint that returns the user's existing container state (running, provisioning, or none) without spawning anything, and have view.js call it on modal open so the UI reflects the actual backend state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related improvements: 1) When a player submits a correct flag, kill their ACI container and delete the ContainerInfoModel row. The container has served its purpose; keeping it running costs money and leaks compute. 2) When provisioning, derive the ACI container-group name from the user's name instead of pure randomness. URLs go from beyondctf-41769317.<region>.azurecontainer.io to beyondctf-claude-a3b8.<region>.azurecontainer.io — easier to debug in Azure portal, and players see their handle. A 4-char random suffix is retained so per-user-per-challenge resets don't collide on the brief window where the old group is being deleted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
route_containers_dashboard referenced a bare `settings` name that doesn't exist in its scope, causing every visit to the admin containers dashboard to 500. Use container_manager.settings, which is the live dict already available in the closure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous regex sanitizer left odd output for usernames with
accents (héllo -> h-llo), pure-symbol names ('!!!' -> empty), or
non-ASCII scripts (日本語 -> empty -> invalid Azure name). Replace
it with a small NFKD-based slugifier that strips diacritics,
collapses non-alphanumerics into single hyphens, caps length, and
falls back to 'user' for anything that slugs to empty. Validated
against 14 edge cases (accents, mixed case, oversized, unicode-only,
all-symbol, leading/trailing punctuation, etc.).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Container challenges previously used a single global CPU/RAM setting (default 1 vCPU / 1.5 GB), so intensive challenges like an LLM couldn't get more without bumping every container. Add a per-challenge Size tier (small/medium/large) selectable in the admin create/edit form. Each tier maps to fixed CPU/memory values that flow through the spawn path into both the ACI and Docker backends; the global setting remains the fallback for the unset path. "small" equals the historical default, so existing challenges are unchanged. A guarded ALTER TABLE in load() adds the new `size` column on upgrade (create_all never alters existing tables) and backfills rows to 'small'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Image <select> was rendered with `disabled` and only re-enabled on the
success branch of the /containers/api/images XHR. If that endpoint errored,
returned non-JSON (e.g. an HTML 5xx page), or the network call failed,
update.js threw on JSON.parse before reaching removeAttribute("disabled").
Disabled <select> elements don't submit, so editing any field on an
existing challenge silently sent an empty `image` and broke the save.
Render the saved image as a pre-selected <option> directly in update.html
and drop the `disabled` attribute, so the form is submittable from page
load regardless of what the registry endpoint does. Update.js now appends
additional registry tags when the XHR succeeds (deduping against the
pre-rendered option) and surfaces a useful status message on failure
instead of leaving the field stuck on "Loading...". Create.js gets the
same error-surfacing treatment so authors can see why the dropdown isn't
populating.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd's challenge-type loader URL-encodes the values it reads from the ContainerChallenge.scripts dict before requesting them, so a value of "/plugins/containers/assets/update.js?v=Da4df62fd" came back to the server as "/plugins/containers/assets/update.js%3Fv%3DDa4df62fd". The `?` (`%3F`) and `=` (`%3D`) became part of the path, Flask's static route returned 404, the JS never ran, and the Image dropdown sat permanently on its "Loading..." placeholder — which also disabled edits to every other field on the challenge form. Drop the _versioned() helper and use plain paths. Admins hard-refresh after a plugin update if their browser cached a stale copy, which they already need to do for HTML/template changes anyway. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd 3.8.x replaced per-plugin scoring fields with a unified Scoring Function selector (Static / Linear / Logarithmic). On Static, the new admin form submits `value` directly and does not expose a way to edit `initial` — the legacy column we relied on as the "static value" sits unchanged at whatever it was at challenge creation. Our calculate_value() ran on every update and solve, and when decay was 0 (Static) it overwrote challenge.value with challenge.initial. Result: admins changed Current Value from 150 to 200, the form sent value=200, the PATCH response came back value=150 — the new value was wiped out in the same request that saved it. The challenge tile on the player board showed the stale 150 too, of course. Drop the overwrite. For Static (decay falsy) just return the challenge as-is so whatever the form set persists. Dynamic scoring (decay > 0) keeps the existing logarithmic recalculation against solve count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous commit removed the value=initial clobber in the static branch of calculate_value(), but it also dropped the only db.session.commit() on that path. update() relies on calculate_value() to commit, so the setattr() changes lived in the session only until the request ended and got rolled back. PATCH response showed the new value (in-memory challenge object) but a reload of the form showed the pre-edit value because nothing had actually hit the DB. Add the commit back on the Static early-return. Dynamic scoring path still commits at its existing line, so nothing else changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd 3.8.x's base admin/challenges/{create,update}.html provides its
own unified Scoring Function selector (Static / Linear / Logarithmic)
and reveals the matching Initial / Decay / Minimum inputs when the
admin picks a non-Static function. Our templates were rendering their
own copies too, which CTFd's JS hid via display:none and stripped of
their `name` attribute so they never submitted — pure DOM noise that
showed up as duplicated `.chal-initial` / `.chal-decay` / `.chal-minimum`
elements in DevTools and confused diagnostics.
Remove the three form-groups from both files. Current Value on the
update form stays — it's the only `name=\"value\"` input on the page
and the only one that actually submits, since CTFd's base template
delegates that field to the plugin via the `{% block value %}` override.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
This PR adds an Azure Container Instances (ACI) backend to the plugin so
operators on Azure can run challenge containers as managed cloud workloads
instead of relying on a single Docker host. It also brings the plugin
forward to CTFd 3.8.x (the existing code targets the older Jinja+jQuery
challenge modal and breaks on the new Alpine.js one), and rolls in several
small fixes that surfaced while shipping a live CTF.
The Docker path is unchanged for existing users — backend selection is a
new setting that defaults to
docker.What's in here
1. Azure Container Instances backend (
container_manager_aci.py)ACIContainerManagermirrors the existingContainerManagerAPI(
create_container,is_container_running,kill_container, …).DefaultAzureCredentialfrom inside the CTFd container —works with the host VM's system-assigned managed identity, with no
static secrets.
referenced from the container group, so the CTFd VM identity only
needs
Contributoron the RG andManaged Identity Operatoron theUAMI; the UAMI gets
AcrPullon the ACR. No registry passwords inCTFd settings.
list via
azure-containerregistry(ContainerRegistryClient).(unicode-aware slugify, 4-char random suffix for collision safety),
e.g.
beyondctf-claude-a3b8.westus2.azurecontainer.io.precision (0.1 GB memory, 0.01 cpu) — Azure rejects e.g.
1500/1024.Backend selection lives in plugin settings (
backend = docker|aci),plus a new Azure-specific settings block (resource group, region, UAMI
resource ID, ACR login server, DNS label prefix).
2. Async provisioning + polling
ACI cold-starts take 20–60s, which blocks the request thread on a
synchronous create. Moved provisioning into a background thread that
writes status (
provisioning/running/failed) ontoContainerInfoModel, and addedGET /containers/api/status/<row_id>for the frontend to poll. The Docker path benefits too — the user
isn't held on a hanging request if their host is slow.
3. CTFd 3.8 compatibility
CTFd 3.8 swapped the challenge modal from Jinja+jQuery to a hard-coded
Alpine.js template that doesn't extend plugin templates. This PR makes
the plugin work on 3.8 without breaking older 3.x:
view.jsno longer dereferencesCTFd.lib.markdownat load time(it's undefined on 3.8). The renderer is initialized lazily inside
preRender, with fallbacks for both the legacy callable shape andthe newer object shape, plus a plain-text-escape fallback if
neither is present.
Time buttons + result panel) is injected from JS in
postRender,anchored off the stable
.challenge-descelement, instead ofrelying on
{% block connection_info %}inview.html(the blockhook no longer fires).
solve()no longer 500s on state-only PATCH whendecayis 0(caused
ZeroDivisionErrorincalculate_value).?v=<sha256-prefix>)so browsers pick up new JS/CSS without manual cache busting.
4. Per-challenge state restore on modal reopen
Closing and reopening a challenge modal used to always show the
Request button, and clicking it would spawn a new container even
when the player already had one running. Added
GET /containers/api/running/<chal_id>which returns the player'scurrent container state (running / provisioning / none) without
spawning anything, and the view JS uses it to render the right UI on
modal open.
5. Tear down container on solve
When a player submits a correct flag for a container challenge, the
plugin now kills their ACI/Docker container and clears the row in
solve(). Saves on idle compute and stops the now-pointless URL fromsitting open.
6. Smaller fixes
("Docker Connected" vs "Azure Container Instances Connected").
route_containers_dashboardno longer references an undefinedsettingsname (was a 500).most operators end up with (Caddy + MariaDB + Redis + bind-mounted
plugin).
Heads-up: per-user mode
Earlier in this branch,
ContainerInfoModelwas switched fromteam_idtouser_idso per-player ACI containers were addressablein user-mode CTFs (which the project I was running). If you'd prefer
to keep team-mode as the default and add user-mode as a setting, I'm
happy to refactor that — it's an isolated change touching
ContainerInfoModel, the four player routes, and one dashboard column.Testing
End-to-end verified against a live CTFd 3.8.4 deployment on Azure
(CTFd in Docker on an Azure VM, ACI in
westus2, ACR with UAMI auth):whoamiSSH challenge → ACI container running withper-user DNS hostname.
view-sourceweb challenge (nginx:alpine+static HTML) → reachable, flag in source.
unicode, all-symbol, oversized, mixed case, leading/trailing
punctuation).
Settings additions (admin → containers → Settings)
backenddocker(default) oraciazure_resource_groupazure_regionwestus2azure_uami_resource_idacr_login_server<registry>.azurecr.ioazure_dns_label_prefixA new
requirements.txtadds:These are only imported lazily when
backend == "aci", so they'reoptional dependencies for Docker-only deployments.
Happy to split this into separate PRs (ACI backend / 3.8 compat /
fixes) if you'd prefer that for review.