feat: Node.js plugin isolation via workerd sandbox#426
Conversation
🦋 Changeset detectedLatest commit: 4f15fc1 The changes in this PR will be included in the next version bump. This PR includes changesets to release 14 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Scope checkThis PR changes 3,719 lines across 29 files. Large PRs are harder to review and more likely to be closed without review. If this scope is intentional, no action needed. A maintainer will review it. If not, please consider splitting this into smaller PRs. See CONTRIBUTING.md for contribution guidelines. |
There was a problem hiding this comment.
Pull request overview
Adds a new Node.js sandbox runner based on workerd (plus a dev runner using Miniflare) to bring Cloudflare-like plugin isolation/capability enforcement to non-Workers deployments, along with core interface updates (SandboxRunner.isHealthy(), SandboxUnavailableError) and accompanying docs/tests.
Changes:
- Introduces new
@emdash-cms/workerdpackage implementing theSandboxRunnercontract (workerd sidecar + backing service; Miniflare dev runner). - Extends core sandbox APIs (new
isHealthy()+SandboxUnavailableError) and wires an opt-outsandbox?: booleanintegration flag. - Updates docs and adds a new conformance/integration test suite for bridge behavior and isolation.
Reviewed changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| pnpm-workspace.yaml | Adds @types/better-sqlite3 to the shared catalog. |
| pnpm-lock.yaml | Locks new workspace package/deps and catalog specifier updates. |
| packages/workerd/tsconfig.json | TypeScript config for the new workerd package build output. |
| packages/workerd/test/plugin-integration.test.ts | Integration tests for real plugin-like operations against the bridge handler. |
| packages/workerd/test/miniflare-isolation.test.ts | Miniflare isolation behavior tests (service bindings, KV, dynamic code, etc.). |
| packages/workerd/test/bridge-handler.test.ts | Bridge handler conformance tests (capabilities, KV/storage isolation, errors). |
| packages/workerd/src/sandbox/wrapper.ts | Generates the plugin wrapper code that proxies ctx.* via HTTP bridge calls. |
| packages/workerd/src/sandbox/runner.ts | Production workerd sidecar runner (process lifecycle, tokens, backing service). |
| packages/workerd/src/sandbox/index.ts | Public exports for the workerd sandbox surface. |
| packages/workerd/src/sandbox/dev-runner.ts | Development runner intended to use Miniflare for faster iteration. |
| packages/workerd/src/sandbox/capnp.ts | Generates workerd capnp config for per-plugin nanoservices. |
| packages/workerd/src/sandbox/bridge-handler.ts | Shared bridge dispatcher implementing capability enforcement and DB access. |
| packages/workerd/src/sandbox/backing-service.ts | Node HTTP backing service wrapper that authenticates and dispatches bridge calls. |
| packages/workerd/src/index.ts | Package entrypoint export wiring. |
| packages/workerd/package.json | Defines the new @emdash-cms/workerd package metadata and deps. |
| packages/marketplace/package.json | Switches @types/better-sqlite3 to catalog version. |
| packages/core/src/plugins/sandbox/types.ts | Adds isHealthy() to SandboxRunner and introduces SandboxUnavailableError. |
| packages/core/src/plugins/sandbox/noop.ts | Implements isHealthy() and updates the “sandbox not available” error text. |
| packages/core/src/plugins/sandbox/index.ts | Exports SandboxUnavailableError. |
| packages/core/src/plugins/index.ts | Re-exports the new sandbox error and HTTP access helpers. |
| packages/core/src/index.ts | Exposes SandboxUnavailableError and HTTP access creation helpers publicly. |
| packages/core/src/emdash-runtime.ts | Improves startup warning when sandbox runner is configured but unavailable. |
| packages/core/src/astro/integration/vite-config.ts | Adds sandbox: false escape hatch behavior to virtual module generation. |
| packages/core/src/astro/integration/runtime.ts | Adds sandbox?: boolean config option documentation/type. |
| packages/core/package.json | Switches @types/better-sqlite3 to catalog version. |
| packages/cloudflare/src/sandbox/runner.ts | Implements isHealthy() for the Cloudflare runner. |
| docs/src/content/docs/plugins/sandbox.mdx | Documents Node+workerd sandboxing and adds the escape hatch + comparison table. |
| docs/src/content/docs/plugins/creating-plugins.mdx | Adds “Testing in the Sandbox” guidance and behavior differences table. |
| .changeset/bumpy-crabs-nail.md | Changeset describing new workerd sandboxing feature and related exports. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Overlapping PRsThis PR modifies files that are also changed by other open PRs:
This may cause merge conflicts or duplicated work. A maintainer will coordinate. |
@emdash-cms/admin
@emdash-cms/auth
@emdash-cms/blocks
@emdash-cms/cloudflare
emdash
create-emdash
@emdash-cms/gutenberg-to-portable-text
@emdash-cms/x402
@emdash-cms/plugin-ai-moderation
@emdash-cms/plugin-atproto
@emdash-cms/plugin-audit-log
@emdash-cms/plugin-color
@emdash-cms/plugin-embeds
@emdash-cms/plugin-forms
@emdash-cms/plugin-webhook-notifier
commit: |
ascorbic
left a comment
There was a problem hiding this comment.
Amazing work on this! I've spent the morning going through this with Claude and it's neary there but I just have a few comments.
I think the fact there's now a bit more latency on the bridge means we should add content.createMany/updateMany/deleteMany to the API to avoid so many round-trips.
For media uploads I think it's best if we move the files out of the JSON bridge entirely, as it's not very efficient. Better to use somehting like the way it's done for public storage, where it gets a signed URL and then is uploaded directly. This could be handled inside the wrapper with a binding so the API appears the same to the plugin itself.
One thing to consider too is defaulting to using Unix sockets instead of TCP (with fallback in Windows dev).
|
Also, we definitely need some real workerd tests for this. Finally: it should be called sandbox-workerd not just workerd |
|
@BenjaminPrice it would be great to get this in! Will you be able to look at the comments, or do you want me to take it on instead? |
|
I somehow missed your previous comment. I’ll get this done.
…On Thu, Apr 30, 2026 at 19:36 Matt Kane ***@***.***> wrote:
*ascorbic* left a comment (emdash-cms/emdash#426)
<#426 (comment)>
@BenjaminPrice <https://github.com/BenjaminPrice> it would be great to
get this in! Will you be able to look at the comments, or do you want me to
take it on instead?
—
Reply to this email directly, view it on GitHub
<#426 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA6MQSEVRBTURN5EOVEH3LL4YMUDVAVCNFSM6AAAAACXTKS63OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGNJRG4ZDSNBQGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
a846039 to
3812e5a
Compare
Addresses all inline feedback from ascorbic's April 18 review on PR emdash-cms#426. Fixes: - Version increment uses idiomatic Kysely object form (bridge-handler) - Blob body headers bug: array-based lookup instead of object access (wrapper) - Request body size limit (10MB) on backing service - storageQuery limit enforcement (cap at 100, matching content/media/users) - SIGTERM handler deduplicated to one per process (module-level singleton) - Windows workerd binary resolution (.exe suffix) - Explicit /__ready endpoint replaces fragile /__health polling - Dev runner enforces wall-time limits (withWallTimeLimit wrapper) - Explicit EMDASH_SANDBOX_DEV env var for dev runner selection - Eager workerd restart on plugin load/unload (50ms debounced) - Handler cache cleanup on plugin unload (keyed by pluginId) - Media upload uses base64 encoding instead of JSON number arrays - Batch content APIs: createMany/updateMany/deleteMany (max 100 items) - Unix socket plumbing for backing service (disabled pending capnp validation) - plugin:deactivate hook fires on marketplace plugin removal - Sandbox availability warning fires regardless of plugin count - Cloudflare Workers detection for sandbox:false (clear error message) - Package renamed from @emdash-cms/workerd to @emdash-cms/sandbox-workerd - Workerd integration test suite (skips if workerd unavailable)
…isolation Proves that miniflare (wrapping workerd) supports all capabilities needed for sandboxed plugin execution on the Node deployment path: - Plugin code loads from strings (no filesystem, bundles from DB/R2 work) - Service bindings between workers provide capability scoping - External service bindings route plugin calls to Node handler functions - KV namespace bindings provide per-plugin isolated storage - Plugins without bindings cannot access unavailable capabilities - Dispose/recreate cycle supports plugin install/uninstall Key finding: miniflare's serviceBindings with async Node handlers eliminates the need for a separate HTTP backing service server. The bridge calls route directly from workerd isolates to Node functions.
…p config Implements the SandboxRunner interface for Node.js deployments using workerd as a sidecar process: - WorkerdSandboxRunner: spawns workerd via child_process, manages lifecycle with epoch-based stale handle detection and health checks - Backing service: authenticated HTTP server in Node handling plugin bridge calls (content, media, KV, storage, email, users, network) - Auth: per-startup HMAC secret, per-plugin tokens encoding capabilities. Server-side capability validation on every request. - capnp config generator: creates workerd config from plugin manifests, each plugin as a nanoservice with its own port - Plugin wrapper: generates JS that runs inside workerd isolate, proxying ctx.* calls via HTTP fetch to the backing service - Wall-time enforcement via Promise.race (matching Cloudflare pattern)
Extends the SandboxRunner interface with isHealthy() for sidecar-based runners where the sandbox process can crash independently of the host. - SandboxRunner.isHealthy(): returns false when sidecar is down - SandboxUnavailableError: typed error for stale handles and unavailable sandbox - NoopSandboxRunner: implements isHealthy() (always false) - CloudflareSandboxRunner: implements isHealthy() (delegates to isAvailable) - WorkerdSandboxRunner: exponential backoff restart on crash (1s, 2s, 4s, cap 30s, give up after 5 failures in 60s), SIGTERM forwarding to child - SandboxNotAvailableError message updated to mention both Cloudflare and workerd sandbox runners (no longer Cloudflare-specific)
…F protection Replaces the naive hostname-only check in the workerd backing service with core's createHttpAccess/createUnrestrictedHttpAccess. This gives the workerd sandbox runner identical behavior to in-process plugins: - Redirect targets revalidated against allowedHosts on each hop - Credential headers stripped on cross-origin redirects - SSRF protection blocks private IPs, cloud metadata endpoints - Max 5 redirects enforced Exports createHttpAccess and createUnrestrictedHttpAccess from the emdash package so platform adapters can reuse the shared policy layer.
…y warnings Adds debugging escape hatch and clearer messaging for sandbox availability: - sandbox: false config option explicitly disables plugin sandboxing even when a sandboxRunner is configured, for isolating whether bugs are in plugin code or in the sandbox runtime - Upgrades sandbox-unavailable log from console.debug to console.warn with actionable message mentioning workerd installation - SandboxNotAvailableError message now references both @emdash-cms/cloudflare/sandbox and @emdash-cms/workerd/sandbox as options
|
@BenjaminPrice this is really close! It would be great if we could address the last few points and get it out |
Sorry, work has been quite busy. I'll try to get to this during this week. |
… fetch The URLSearchParams branch of marshalRequestInit was using object-bracket header assignment on out.headers, but out.headers is an array of [name, value] pairs everywhere else in the function. The bracket assignment put a non-numeric property on the array — invisible to the bridge handler's unmarshalRequestInit (which iterates pairs) and dropped by JSON.stringify during transport. Servers received URLSearchParams POSTs without a Content-Type and rejected them as non-form-encoded. Switch to the same array-shape pattern used by the Blob branch above.
Math.min(Number(opts.limit) || 50, 100) accepted negative numbers: a
plugin sending { limit: -5 } produced query.limit(-4), which SQLite
silently treats as zero rows and PostgreSQL rejects. Apply the same
Math.max(1, ...) floor used in userList to contentList, mediaList, and
storageQuery, preserving storageQuery's undefined-passthrough so callers
can still omit the limit field.
waitForReady() only probed plugins.values().next().value. With N plugins workerd opens N sockets and any one of them can fail to bind (port collision, import-time error). If plugin 1 bound but plugin 2 didn't, the runner reported ready and the next invokeHook on plugin 2 hung to the per-call timeout or hit ECONNREFUSED. Extract a probeAllReady() helper that fetches /__ready on every plugin in parallel and returns true only when every probe is ok. waitForReady() calls it each tick. Helper is module-exported for tests but not added to the package barrel.
contentCreateMany / contentUpdateMany / contentDeleteMany looped with await across per-item helpers. If item 50 of 100 threw, the first 49 were already committed, the plugin saw a generic error, and there was no way to know which prefix had landed. A plugin author calling a verb named "createMany" expects atomicity. Wrap each many-op body in db.transaction().execute(trx => ...) so a mid-batch failure rolls the whole batch back. MAX_BATCH_SIZE guard stays outside the transaction.
The exit handler registered inside restart() captured this and would unconditionally mutate this.workerdProcess and this.healthy whenever it fired. If a late exit from a prior process landed after a newer workerd had already been assigned to this.workerdProcess, the stale handler nulled out the live process's handle and marked it unhealthy. Capture proc into a local immediately after spawn and route the exit listener through makeWorkerdExitHandler, which checks host.workerdProcess !== proc and bails before touching shared state. Helper is module-exported for unit testing but not added to the package barrel.
scheduleEagerStart used `void this.ensureRunning()`, which swallowed startup rejections (ENOENT on the workerd binary, capnp parse errors, waitForReady timeouts). On Node 22+ those rejections become unhandledRejection, and they bypassed crashCount/scheduleRestart so the runner stayed silently down with no automatic retry. Attach a .catch that logs the error and calls scheduleRestart, which already handles backoff and the 5-failure-window cap.
stopWorkerd's SIGTERM-then-SIGKILL fallback timer was never cleared on graceful exit. The `if (!exited)` guard prevented an actual signal from being sent, but the pending timer itself kept the Node event loop alive for up to 5s past termination -- so terminateAll() and shutdown stalled. Extract the kill-with-escalation dance into waitForProcessExit() and clear the timer inside the exit listener. Switch the escalation check to proc.exitCode === null, which matches actual process state more precisely than the exited flag.
nextPluginPort climbed monotonically and unloadPlugin never returned the port to the pool. Long-running sites with frequent marketplace install/uninstall (or dev watcher reloads) eventually walked toward the top of the ephemeral range, raising the odds of collision with other listeners on the host. Add a freePorts stack; load() pops from it before incrementing nextPluginPort, unloadPlugin() pushes the freed port back. This does not handle pre-existing host-port collisions (no EADDRINUSE probe yet), but it prevents the leak the bot flagged.
The map was written on bridge-handler insertion and iterated in removePlugin, but no read path ever consulted it -- the cache was keyed by claims.pluginId, not by token. Pure dead code that grew unbounded under plugin churn.
The runner spawned workerd with env: { ...process.env }, handing every
parent env var to a child that runs untrusted plugin code. Empirical
testing showed workerd 1.x's nodejs_compat polyfill returns an empty
process.env to plugins regardless of the child's actual environment,
so today this isn't an active leak -- but that behavior is an
undocumented implementation detail that a future workerd release could
change without warning.
Introduce minimalWorkerdEnv() that allowlists only what workerd needs
to start (PATH, HOME, TMPDIR/TMP/TEMP, LANG, LC_ALL). Operators who
need extra vars can list them in EMDASH_WORKERD_PASSTHROUGH_ENV. The
integration test suite still spawns a real workerd and confirms it
boots under the new minimal env.
Brings 138 commits from upstream main onto the branch so it merges
cleanly into the PR base. Conflicts resolved:
- packages/core/src/emdash-runtime.ts: upstream refactored
syncMarketplacePlugins into a generic syncSandboxedSourcePlugins
(also adding syncRegistryPlugins). Kept that structure while
preserving the sandbox-bypass branch we added in this PR.
Also propagated the SandboxedPlugin -> SandboxedPluginInstance
rename through loadSandboxedPlugins' signature.
- pnpm-lock.yaml: regenerated via pnpm install on top of upstream's
lockfile so workerd's deps (workerd, ulidx, miniflare,
better-sqlite3, kysely, etc.) are present.
- packages/workerd/src/sandbox/{runner,dev-runner}.ts: imported the
renamed SandboxedPluginInstance type and updated return types /
implements clauses to match.
All 73 workerd tests pass after the merge.
ascorbic
left a comment
There was a problem hiding this comment.
Amazing work! Great to get this in
|
Great work guys! Someone still has to press the merge button I guess? |
What does this PR do?
Adds workerd-based plugin sandboxing for Node.js deployments, closing EmDash's most significant architectural gap: plugins that run sandboxed on Cloudflare Workers now run sandboxed on Node.js too, with matching capability enforcement and isolation guarantees.
Discussion: #425
How it works
A new
@emdash-cms/workerdpackage implements theSandboxRunnerinterface (same contract as@emdash-cms/cloudflare). In production it spawns workerd as a child process with a generated capnp config. In development it uses miniflare for faster startup. Plugin workers run in V8 isolates and communicate with Node via an authenticated HTTP backing service.Architecture:
WorkerdSandboxRunner(production): spawns workerd, generates capnp config per plugin, manages lifecycle with exponential backoff restart, SIGTERM forwarding, intentional-stop guard, and epoch-based stale handle detectionMiniflareDevRunner(dev): uses miniflare'soutboundServiceto route bridge calls to Node handler functions; non-bridge fetches are blocked to prevent capability bypassbridge-handler.ts: shared bridge dispatch logic used by both runners, with capability enforcement matching the Cloudflare PluginBridgeSecurity model matches Cloudflare:
timingSafeEqualin Node, hand-rolled XOR in workerd where Web Crypto doesn't expose it)createHttpAccess(redirect revalidation on each hop, credential stripping on cross-origin redirects, SSRF blocking, max 5 redirects)write:contentdoes NOT implyread:content(matches Cloudflare bridge)fetch()from plugin code is blocked in dev (miniflareoutboundServicereturns 403) and routed through the backing service in prod (capnpglobalOutbound), forcing all network access throughctx.http.fetchand capability/host checksHonest about resource limits:
cpuMs,memoryMb, andsubrequestsare Cloudflare platform features, not standalone workerd. OnlywallTimeMsis enforced (viaPromise.race). Operators get a startup warning if they configure unenforced limits, and the docs include a caution box.Sandbox bypass mode (
sandbox: false): Debugging escape hatch that loads sandboxed plugin entries in-process viaadaptSandboxEntry+ data URLimport(). Build-time and marketplace plugins both work under bypass; marketplace install/update routes skip theSANDBOX_NOT_AVAILABLEgate when bypassed and rebuild the hook pipeline so changes take effect immediately.Type of change
Checklist
pnpm typecheckpassespnpm --silent lint:json | jq '.diagnostics | length'returns 0 errorspnpm testpasses (or targeted tests for my change)pnpm formathas been runAI-generated code disclosure
Tests
36 new tests in
@emdash-cms/workerdacross 3 files, plus 2163 existing core tests passing:miniflare-isolation.test.ts(6 tests): V8 isolate isolation, service bindings, KV namespaces, dynamic code loading, worker reconfigurationbridge-handler.test.ts(19 tests): KV CRUD + isolation, strict capability enforcement (read/write content do not imply each other;network:fetch:anysatisfiesnetwork:fetch), storage collection validation, error message parity with Cloudflare bridgeplugin-integration.test.ts(11 tests): Real plugin operations modeled after thesandboxed-testplugin (KV round-trip, storage round-trip, content lifecycle with ULID/versioning/soft-delete, cross-plugin isolation, write-only plugins cannot read)All tests use real in-memory SQLite with production schema, no mocking.
Review iterations
This PR went through ~20 rounds of automated and human review. Notable hardening from review feedback:
timingSafeEqual+ hand-rolled for workerd), miniflareoutboundServiceblocks non-bridge fetches, strict capability enforcement (write:content≠read:content)startupPromise,isHealthy()returns false when restart pending,stopWorkerd()fast-paths on already-exited and uses localexitedflag for SIGKILL fallbackctx.http.fetchreturns base64-encoded bytes preserving binary content, RequestInit marshaling preserves multi-value headers ([[name, value]]pairs), Blob/File/FormData/URLSearchParams bodies, ArrayBuffer withbyteOffset/byteLengthwindowstorageQuery/storageCounttoPluginStorageRepositorysowhere/orderBy/cursor/countwork correctly. Fixes infinite-loop pagination on shipped plugins likeforms-submissions.Storageadapter, setsstatus='ready', rolls back storage object on DB failure (best-effort cleanup); delete removes the storage object toobin/workerd(nonpxruntime download),execFileSyncfor paths with spaces, stdout/stderr drained to prevent pipe deadlockloadMarketplacePluginsBypassedruns before pipeline creation on cold start;syncMarketplacePluginsBypassedhandles runtime install/update/uninstall and rebuilds the hook pipelineChanges
New:
packages/workerd/src/sandbox/runner.tsWorkerdSandboxRunnerimplementingSandboxRunner, child process lifecycle, auth token generation, intentional-stop guard, epoch-based stale handle detectionsrc/sandbox/dev-runner.tsMiniflareDevRunnerfor development, auto-selected whenNODE_ENV === "development"src/sandbox/bridge-handler.tsPluginStorageRepositorysrc/sandbox/backing-service.tssrc/sandbox/capnp.tsglobalOutboundroutes through backing servicesrc/sandbox/wrapper.tsModified:
packages/core/plugins/sandbox/types.tsisHealthy()toSandboxRunnerinterface,SandboxUnavailableErrorclass,mediaStorageinSandboxOptionsplugins/sandbox/noop.tsisHealthy(), error message mentions both CF and workerdastro/integration/runtime.tssandbox?: booleanconfig option (escape hatch)astro/integration/virtual-modules.tssandboxBypassedflag whensandbox: falseastro/middleware.tssandboxBypassedvia namespace import (handles missing export)emdash-runtime.tsloadBypassedPlugins+loadMarketplacePluginsBypassed, runtime sync,isSandboxBypassed(), threadsmediaStorageinto sandbox runnersapi/handlers/marketplace.tshandleMarketplaceInstall/handleMarketplaceUpdateacceptsandboxBypassed, skipSANDBOX_NOT_AVAILABLEgate when setindex.tsSandboxUnavailableError,createHttpAccess,createUnrestrictedHttpAccess, repositories used by platform adaptersModified:
packages/cloudflare/sandbox/runner.tsisHealthy()(delegates toisAvailable()), passesstorageConfigto bridge bindingssandbox/bridge.tsPluginStorageRepository(parity fix)Docs
plugins/sandbox.mdxplugins/creating-plugins.mdxChangeset
.changeset/bumpy-crabs-nail.md—emdashminor,@emdash-cms/cloudflarepatch,@emdash-cms/workerdminor.