Skip to content

Add Azure Container Instances backend + CTFd 3.8 compatibility#5

Open
therealcybermattlee wants to merge 23 commits into
andyjsmith:mainfrom
therealcybermattlee:main
Open

Add Azure Container Instances backend + CTFd 3.8 compatibility#5
therealcybermattlee wants to merge 23 commits into
andyjsmith:mainfrom
therealcybermattlee:main

Conversation

@therealcybermattlee
Copy link
Copy Markdown

TL;DR

This PR adds an Azure Container Instances (ACI) backend to the plugin so
operators on Azure can run challenge containers as managed cloud workloads
instead of relying on a single Docker host. It also brings the plugin
forward to CTFd 3.8.x (the existing code targets the older Jinja+jQuery
challenge modal and breaks on the new Alpine.js one), and rolls in several
small fixes that surfaced while shipping a live CTF.

The Docker path is unchanged for existing users — backend selection is a
new setting that defaults to docker.


What's in here

1. Azure Container Instances backend (container_manager_aci.py)

  • ACIContainerManager mirrors the existing ContainerManager API
    (create_container, is_container_running, kill_container, …).
  • Auth via DefaultAzureCredential from inside the CTFd container —
    works with the host VM's system-assigned managed identity, with no
    static secrets.
  • ACR image pulls use a user-assigned managed identity (UAMI)
    referenced from the container group, so the CTFd VM identity only
    needs Contributor on the RG and Managed Identity Operator on the
    UAMI; the UAMI gets AcrPull on the ACR. No registry passwords in
    CTFd settings.
  • Image dropdown in the admin form pulls the live ACR repository/tag
    list via azure-containerregistry (ContainerRegistryClient).
  • Per-container DNS labels are derived from the player's username
    (unicode-aware slugify, 4-char random suffix for collision safety),
    e.g. beyondctf-claude-a3b8.westus2.azurecontainer.io.
  • Memory/CPU requests are clamped and rounded to Azure's allowed
    precision (0.1 GB memory, 0.01 cpu) — Azure rejects e.g. 1500/1024.

Backend selection lives in plugin settings (backend = docker|aci),
plus a new Azure-specific settings block (resource group, region, UAMI
resource ID, ACR login server, DNS label prefix).

2. Async provisioning + polling

ACI cold-starts take 20–60s, which blocks the request thread on a
synchronous create. Moved provisioning into a background thread that
writes status (provisioning / running / failed) onto
ContainerInfoModel, and added GET /containers/api/status/<row_id>
for the frontend to poll. The Docker path benefits too — the user
isn't held on a hanging request if their host is slow.

3. CTFd 3.8 compatibility

CTFd 3.8 swapped the challenge modal from Jinja+jQuery to a hard-coded
Alpine.js template that doesn't extend plugin templates. This PR makes
the plugin work on 3.8 without breaking older 3.x:

  • view.js no longer dereferences CTFd.lib.markdown at load time
    (it's undefined on 3.8). The renderer is initialized lazily inside
    preRender, with fallbacks for both the legacy callable shape and
    the newer object shape, plus a plain-text-escape fallback if
    neither is present.
  • The challenge view UI (Get Connection Info / Reset / Stop / Add
    Time buttons + result panel) is injected from JS in postRender,
    anchored off the stable .challenge-desc element, instead of
    relying on {% block connection_info %} in view.html (the block
    hook no longer fires).
  • solve() no longer 500s on state-only PATCH when decay is 0
    (caused ZeroDivisionError in calculate_value).
  • Asset URLs include a content-hash query string (?v=<sha256-prefix>)
    so browsers pick up new JS/CSS without manual cache busting.

4. Per-challenge state restore on modal reopen

Closing and reopening a challenge modal used to always show the
Request button, and clicking it would spawn a new container even
when the player already had one running. Added
GET /containers/api/running/<chal_id> which returns the player's
current container state (running / provisioning / none) without
spawning anything, and the view JS uses it to render the right UI on
modal open.

5. Tear down container on solve

When a player submits a correct flag for a container challenge, the
plugin now kills their ACI/Docker container and clears the row in
solve(). Saves on idle compute and stops the now-pointless URL from
sitting open.

6. Smaller fixes

  • Admin containers dashboard badge reflects the active backend
    ("Docker Connected" vs "Azure Container Instances Connected").
  • route_containers_dashboard no longer references an undefined
    settings name (was a 500).
  • README has a Docker-Compose deployment section matching the layout
    most operators end up with (Caddy + MariaDB + Redis + bind-mounted
    plugin).

Heads-up: per-user mode

Earlier in this branch, ContainerInfoModel was switched from
team_id to user_id so per-player ACI containers were addressable
in user-mode CTFs (which the project I was running). If you'd prefer
to keep team-mode as the default and add user-mode as a setting, I'm
happy to refactor that — it's an isolated change touching
ContainerInfoModel, the four player routes, and one dashboard column.


Testing

End-to-end verified against a live CTFd 3.8.4 deployment on Azure
(CTFd in Docker on an Azure VM, ACI in westus2, ACR with UAMI auth):

  • ✅ Provision a whoami SSH challenge → ACI container running with
    per-user DNS hostname.
  • ✅ Provision a view-source web challenge (nginx:alpine +
    static HTML) → reachable, flag in source.
  • ✅ Reset / Add Time / Stop all behave correctly.
  • ✅ State persists across modal close+reopen.
  • ✅ Correct flag submission tears down the container.
  • ✅ Slug helper validated against 14 username edge cases (accents,
    unicode, all-symbol, oversized, mixed case, leading/trailing
    punctuation).
  • ✅ Admin containers dashboard loads with the right backend label.

Settings additions (admin → containers → Settings)

Key Purpose
backend docker (default) or aci
azure_resource_group RG name (ACI lands here)
azure_region e.g. westus2
azure_uami_resource_id full ARM ID of the UAMI used for ACR pulls
acr_login_server e.g. <registry>.azurecr.io
azure_dns_label_prefix base for per-user hostnames

A new requirements.txt adds:

azure-identity
azure-mgmt-containerinstance
azure-containerregistry

These are only imported lazily when backend == "aci", so they're
optional dependencies for Docker-only deployments.


Happy to split this into separate PRs (ACI backend / 3.8 compat /
fixes) if you'd prefer that for review.

therealcybermattlee and others added 23 commits May 14, 2026 16:51
Containers are now tracked per-user instead of per-team: ContainerInfoModel
swaps team_id for user_id with an FK to users.id, the four player routes
drop the team membership check and pass user.id, and the admin dashboard
relabels the column accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces a second container backend selectable from the admin
Settings page. Choosing "Azure Container Instances" provisions a
container group per challenge attempt via azure-mgmt-containerinstance,
authenticates to Azure with DefaultAzureCredential, and attaches a
user-assigned managed identity to each spawned group so it can pull
private images from ACR without storing registry credentials.

Schema gains a nullable `hostname` column on ContainerInfoModel so
each ACI container's per-group DNS name can be returned to the player;
the Docker backend still falls back to the global docker_hostname
setting. Settings page is restructured around a backend selector with
show/hide between Docker and Azure fieldsets.

Also includes a handful of defensive fixes the design review flagged:
the manager exposes a shutdown() method that the settings route now
calls before swapping backends (was leaking BackgroundScheduler
instances on every save); the ACI poller has an explicit 300s timeout
and 5s polling interval; all previously silent except blocks log to
stdout; the expired-container sweep commits once per pass instead of
per row; and the admin kill route no longer crashes when the DB row
has already been removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The synchronous request flow couldn't survive a 30-60 second ACI
provision through CTFd's stock Gunicorn (30s default timeout) and
nginx (60s default proxy_read_timeout), let alone Cloudflare's 100s
edge limit. POST /containers/api/request now writes a placeholder row
with status="provisioning", spawns a daemon thread to call the backend,
and returns HTTP 202 with the row id. A new GET /api/status/<id> lets
the frontend poll every 3 seconds (up to 3 minutes) until the row
flips to running or failed.

ContainerInfoModel switches to an auto-increment id primary key so
the row exists before the backend has assigned a container_id;
container_id, port, and hostname are nullable until provisioning
completes; status and error_message are new columns. A unique
constraint on (challenge_id, user_id) closes the TOCTOU window —
concurrent requests catch IntegrityError and return the winning row.

If a user stops or resets while provisioning is still in flight, the
worker re-fetches the row after create succeeds; finding it gone, it
kills the just-created container so we don't leak (and pay for) an
orphan ACI group.

Admin /api/kill, /api/purge, /api/stop, and the admin dashboard
template now handle rows where container_id is still None (because
the backend hasn't been called yet), and the dashboard adds a Status
column showing provisioning / running / failed.

The view.js request and reset handlers are rewritten in fetch with a
polling loop; renew and stop stay XHR since they're fast operations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clamp container memory to [0.1, 16] GB and CPU to [0.1, 4.0] cores so
an admin can't submit a spec ACI will reject at provision time. Cap
get_images at 20 tags per repo and order by last-updated descending,
so a large ACR doesn't hang the create-challenge dropdown while it
enumerates every tag. Pass UserAssignedIdentities() rather than {}
to ContainerGroupIdentity for SDK type-correctness; the wire payload
is identical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walks through clone, custom Dockerfile that pip-installs the Azure
SDK, compose.yaml wiring with service-principal env vars, Caddy as
the TLS front-end, schema-drop instructions for upgraders, and an
end-to-end verification step. Matches the common self-hosted layout
where compose.yaml, CTFd/, and caddy/ sit side-by-side under a single
parent directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the compose.yaml example to mirror the typical ~/ctfd-stack
configuration: ctfd/ctfd:3.8.4 image swap to build:, env vars carried
through (REVERSE_PROXY, WORKERS, DATABASE_URL pattern), .ctfd_secret_key
file mount with a pre-create step so Docker doesn't create it as a
directory. Calls out the ctfd_frontend + ctfd_backend dual-attachment
required for Azure egress and DB access. Notes Caddy with Cloudflare
DNS-01 (caddy-cloudflare image) needs no proxy-timeout changes thanks
to the async refactor.

Adds a managed-identity vs service-principal split for Azure auth, with
shell commands for assigning a system-assigned MI to the CTFd VM and
granting Contributor + AcrPull on the appropriate scopes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The badge always read "Docker Connected/Not Connected" regardless of
backend. When ACI is the active backend, label it accordingly so admins
can tell at a glance which provisioner is wired up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
State-only PATCHes (e.g. Finish -> set visible) re-run calculate_value
on the existing challenge. If decay is 0, the curve math hits a divide
by zero and returns 500, leaving the Options modal stuck open. Treat
decay=0 as a static challenge worth `initial` instead of throwing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`CTFd.lib.markdown()` is undefined at script-parse time on CTFd 3.8.x,
so reading it at module top-level threw a TypeError that aborted the
rest of view.js. Without preRender/render/postRender registered, the
challenge modal failed to open. Initialize the renderer lazily inside
preRender (and on first render() as a fallback).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Append ?v=<sha256-prefix> to create.js/update.js/view.js URLs so
browsers fetch a fresh copy whenever the file content changes. The
hash is computed once at module load, so a container restart picks
up new asset content automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd 3.8.x ships with CTFd.lib only exposing {\$, dayjs} — the markdown
helper isn't there. Detect both legacy (CTFd.lib.markdown() returns a
renderer) and newer (CTFd.lib.markdown is the renderer) shapes, and
fall back to escaped plain text if neither exists so the modal still
opens.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd 3.8.x renders the challenge modal server-side using Alpine.js, and
the {% block connection_info %} override in view.html no longer applies
(block name/structure changed in newer CTFd). Inject the Request
Connection Info button (and reset/stop/renew controls) via JS in
postRender by anchoring off the stable .challenge-desc element. This
removes our dependency on CTFd's internal template structure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ACI rejects memory_in_gb that isn't a multiple of 0.1 (e.g. 1500 MB /
1024 = 1.46484375 GB returned MemoryRequirementNotTimesOfOneTenthGB).
Round memory to the nearest 0.1 GB and cpu to the nearest 0.01 cores
so user-provided MB/CPU settings always satisfy Azure's precision.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously, closing and reopening the challenge modal would always show
the Request Connection Info button, and clicking it would spawn a NEW
container even if the user already had one running for that challenge.

Add a GET /api/running/<chal_id> endpoint that returns the user's
existing container state (running, provisioning, or none) without
spawning anything, and have view.js call it on modal open so the UI
reflects the actual backend state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related improvements:

1) When a player submits a correct flag, kill their ACI container and
   delete the ContainerInfoModel row. The container has served its
   purpose; keeping it running costs money and leaks compute.

2) When provisioning, derive the ACI container-group name from the
   user's name instead of pure randomness. URLs go from
   beyondctf-41769317.<region>.azurecontainer.io to
   beyondctf-claude-a3b8.<region>.azurecontainer.io — easier to debug
   in Azure portal, and players see their handle. A 4-char random
   suffix is retained so per-user-per-challenge resets don't collide
   on the brief window where the old group is being deleted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
route_containers_dashboard referenced a bare `settings` name that
doesn't exist in its scope, causing every visit to the admin
containers dashboard to 500. Use container_manager.settings, which is
the live dict already available in the closure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous regex sanitizer left odd output for usernames with
accents (héllo -> h-llo), pure-symbol names ('!!!' -> empty), or
non-ASCII scripts (日本語 -> empty -> invalid Azure name). Replace
it with a small NFKD-based slugifier that strips diacritics,
collapses non-alphanumerics into single hyphens, caps length, and
falls back to 'user' for anything that slugs to empty. Validated
against 14 edge cases (accents, mixed case, oversized, unicode-only,
all-symbol, leading/trailing punctuation, etc.).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Container challenges previously used a single global CPU/RAM setting
(default 1 vCPU / 1.5 GB), so intensive challenges like an LLM couldn't
get more without bumping every container.

Add a per-challenge Size tier (small/medium/large) selectable in the
admin create/edit form. Each tier maps to fixed CPU/memory values that
flow through the spawn path into both the ACI and Docker backends; the
global setting remains the fallback for the unset path. "small" equals
the historical default, so existing challenges are unchanged.

A guarded ALTER TABLE in load() adds the new `size` column on upgrade
(create_all never alters existing tables) and backfills rows to 'small'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Image <select> was rendered with `disabled` and only re-enabled on the
success branch of the /containers/api/images XHR. If that endpoint errored,
returned non-JSON (e.g. an HTML 5xx page), or the network call failed,
update.js threw on JSON.parse before reaching removeAttribute("disabled").
Disabled <select> elements don't submit, so editing any field on an
existing challenge silently sent an empty `image` and broke the save.

Render the saved image as a pre-selected <option> directly in update.html
and drop the `disabled` attribute, so the form is submittable from page
load regardless of what the registry endpoint does. Update.js now appends
additional registry tags when the XHR succeeds (deduping against the
pre-rendered option) and surfaces a useful status message on failure
instead of leaving the field stuck on "Loading...". Create.js gets the
same error-surfacing treatment so authors can see why the dropdown isn't
populating.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd's challenge-type loader URL-encodes the values it reads from the
ContainerChallenge.scripts dict before requesting them, so a value of
"/plugins/containers/assets/update.js?v=Da4df62fd" came back to the
server as "/plugins/containers/assets/update.js%3Fv%3DDa4df62fd". The
`?` (`%3F`) and `=` (`%3D`) became part of the path, Flask's static
route returned 404, the JS never ran, and the Image dropdown sat
permanently on its "Loading..." placeholder — which also disabled
edits to every other field on the challenge form.

Drop the _versioned() helper and use plain paths. Admins hard-refresh
after a plugin update if their browser cached a stale copy, which they
already need to do for HTML/template changes anyway.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd 3.8.x replaced per-plugin scoring fields with a unified Scoring
Function selector (Static / Linear / Logarithmic). On Static, the new
admin form submits `value` directly and does not expose a way to edit
`initial` — the legacy column we relied on as the "static value" sits
unchanged at whatever it was at challenge creation.

Our calculate_value() ran on every update and solve, and when decay was
0 (Static) it overwrote challenge.value with challenge.initial. Result:
admins changed Current Value from 150 to 200, the form sent value=200,
the PATCH response came back value=150 — the new value was wiped out
in the same request that saved it. The challenge tile on the player
board showed the stale 150 too, of course.

Drop the overwrite. For Static (decay falsy) just return the challenge
as-is so whatever the form set persists. Dynamic scoring (decay > 0)
keeps the existing logarithmic recalculation against solve count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous commit removed the value=initial clobber in the static branch
of calculate_value(), but it also dropped the only db.session.commit()
on that path. update() relies on calculate_value() to commit, so the
setattr() changes lived in the session only until the request ended
and got rolled back. PATCH response showed the new value (in-memory
challenge object) but a reload of the form showed the pre-edit value
because nothing had actually hit the DB.

Add the commit back on the Static early-return. Dynamic scoring path
still commits at its existing line, so nothing else changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CTFd 3.8.x's base admin/challenges/{create,update}.html provides its
own unified Scoring Function selector (Static / Linear / Logarithmic)
and reveals the matching Initial / Decay / Minimum inputs when the
admin picks a non-Static function. Our templates were rendering their
own copies too, which CTFd's JS hid via display:none and stripped of
their `name` attribute so they never submitted — pure DOM noise that
showed up as duplicated `.chal-initial` / `.chal-decay` / `.chal-minimum`
elements in DevTools and confused diagnostics.

Remove the three form-groups from both files. Current Value on the
update form stays — it's the only `name=\"value\"` input on the page
and the only one that actually submits, since CTFd's base template
delegates that field to the plugin via the `{% block value %}` override.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant